Introduction to GBLUP

Genomic BLUP (GBLUP) is a simple way to utilize genotypes to create a new relationship among animals. Some call this matrix the ‘realized’ relationship matrix (G) in place of the ‘expected’ relationship matrix (A). This is because the A matrix cannot exploit the Mendelian sampling term when offspring inherit chromosome segments from the parents.

\[ EBV_{offspring} = \frac{1}{2}EBV_{sire} + \frac{1}{2}EBV_{dam} + \phi \]

Without phenotyping or genomic information (think of a young bull calf), the equation becomes

\[ EBV_{offspring} = \frac{1}{2}EBV_{sire} + \frac{1}{2}EBV_{dam} \]

An example can be seen in the following histogram from (Lourenco et al. 2015) showing observed vs expected relationships among full-sibs. The expected relationship for full-sibs is 0.5 (non-inbred). Genomic relationships allow us to exploit the Mendelian sampling portion (\(\phi\))of the breeding value equation from parents. This can separate full-sibs prior to having any phenotypes. People often forget however, that phenotyping can exploit at least part of the Mendelian sampling term.

Think of two parents both being Aa for a major QTL. If you cross Aa x Aa parents, the offspring will be aa, Aa, and AA. Prior to genotyping the offspring, there would be no way to separate offspring due to this major QTL. Genotyping allows us to observe this characteristic directly.

Example: For instance, let us say that a boar (male pig) is heterozygous at a large QTL for meat quality (Aa). Comparing two full-sib piglets, the \(\phi\) term would be negative if the pig received the bad recessive stress gene (a) from the sire while the \(\phi\) term would be positive (creating a better EBV) for the other full-sib that received the A allele. These two full-sibs would be less related based on genomic information, but equally related without known genotypes.

We will be using an example from Mrode’s 3rd Edition of Linear Models for the Prediction of Animal Breeding Values. This book is very excellent because it has examples that you can follow along and program yourself if you don’t understand.

Model

The simplest animal model is in the following form:

\[ y = Xb + Za + e \]

Where:

\(y\) = vector of phenotypes

\(b\) = fixed effect estimates

\(a\) = vector of random breeding values

\(e\) = residuals

\(X\),\(Z\) = incidence matrices for fixed and random (animal) effects, respectively

See Mrode’s Chapter 3 for information on this model if you are not familiar.

MME

And the cooresponding MME are typically shown as:

\[ \left [ \begin{array}{cl} \boldsymbol{X'X} & \boldsymbol{X'Z} \\ \boldsymbol{Z'X} & \boldsymbol{Z'Z} + \boldsymbol{A}^{-1} \lambda \\ \end{array} \right ] \left [ \begin{array}{c} \boldsymbol{b} \\ \boldsymbol{a} \end{array} \right ] = \left [ \begin{array}{c} \boldsymbol{X'y} \\ \boldsymbol{Z'y} \end{array} \right ] \]

where:

\(\lambda\) = \(\frac{\sigma^{2}_{e}}{\sigma^{2}_{a}}\)

and the rest has been described above.

GBLUP

GBLUP is simply an extension of the MME where we replace the expected relationship matrix (A) with the realized relationship matrix (G) in the MME.

\[ \left [ \begin{array}{cl} \boldsymbol{X'X} & \boldsymbol{X'Z} \\ \boldsymbol{Z'X} & \boldsymbol{Z'Z} + \boldsymbol{G}^{-1} \lambda \\ \end{array} \right ] \left [ \begin{array}{c} \boldsymbol{b} \\ \boldsymbol{a} \end{array} \right ] = \left [ \begin{array}{c} \boldsymbol{X'y} \\ \boldsymbol{Z'y} \end{array} \right ] \]

Notice, the ONLY thing that changed was the G matrix in place of the A matrix. One issue can be, depending on the way you derive G, the variances calculated from REML or AIREML can be quite different (see Forni 2011 (Forni, Aguilar, and Misztal 2011)).

Example

The following example is from Chapter 11 from Mrode’s Linear Models book.

##    animal sire dam mean EDC fat_DYD SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
## 1      13    0   0    1 558     9.0    2    0    1    1    0    0    0    2    1     2
## 2      14    0   0    1 722    13.4    1    0    0    0    0    2    0    2    1     0
## 3      15   13   4    1 300    12.7    1    1    2    1    1    0    0    2    1     2
## 4      16   15   2    1  73    15.4    0    0    2    1    0    1    0    2    2     1
## 5      17   15   5    1  52     5.9    0    1    1    2    0    0    0    2    1     2
## 6      18   14   6    1  87     7.7    1    1    0    1    0    2    0    2    2     1
## 7      19   14   9    1  64    10.2    0    0    1    1    0    2    0    2    2     0
## 8      20   14   9    1 103     4.8    0    1    1    0    0    1    0    2    2     0
## 9      21    1   3    1  13     7.6    2    0    0    0    0    1    2    2    1     2
## 10     22   14   8    1 125     8.8    0    0    0    1    1    2    0    2    0     0
## 11     23   14  11    1  93     9.8    0    1    1    0    0    1    0    2    2     1
## 12     24   14  10    1  66     9.2    1    0    0    0    1    1    0    2    0     0
## 13     25   14   7    1  75    11.5    0    0    0    1    1    2    0    2    1     0
## 14     26   14  12    1  33    13.3    1    0    1    1    0    2    0    1    0     0

The pedigree:

##    animal sire dam
## 1       1    0   0
## 2       2    0   0
## 3       3    0   0
## 4       4    0   0
## 5       5    0   0
## 6       6    0   0
## 7       7    0   0
## 8       8    0   0
## 9       9    0   0
## 10     10    0   0
## 11     11    0   0
## 12     12    0   0
## 13     13    0   0
## 14     14    0   0
## 15     15   13   4
## 16     16   15   2
## 17     17   15   5
## 18     18   14   6
## 19     19   14   9
## 20     20   14   9
## 21     21    1   3
## 22     22   14   8
## 23     23   14  11
## 24     24   14  10
## 25     25   14   7
## 26     26   14  12

Processing the marker matrix (M)

First you need to extract the M matrix (marker genotype matrix).

# Extract M (convert to Matrix type)
M <- Matrix(as.matrix(data.11.1[, 7:16]))

# add rownames of animals to M
rownames(M) <- data.11.1$animal

# print M
print(M)

## 14 x 10 Matrix of class "dgeMatrix"
##    SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
## 13    2    0    1    1    0    0    0    2    1     2
## 14    1    0    0    0    0    2    0    2    1     0
## 15    1    1    2    1    1    0    0    2    1     2
## 16    0    0    2    1    0    1    0    2    2     1
## 17    0    1    1    2    0    0    0    2    1     2
## 18    1    1    0    1    0    2    0    2    2     1
## 19    0    0    1    1    0    2    0    2    2     0
## 20    0    1    1    0    0    1    0    2    2     0
## 21    2    0    0    0    0    1    2    2    1     2
## 22    0    0    0    1    1    2    0    2    0     0
## 23    0    1    1    0    0    1    0    2    2     1
## 24    1    0    0    0    1    1    0    2    0     0
## 25    0    0    0    1    1    2    0    2    1     0
## 26    1    0    1    1    0    2    0    1    0     0

You should always check a few major things!!

Quality of SNP and Animal genotypes (i.e. Call Rates!). Default = 90%
Minor Allele Frequency (MAF) > 1% or 5%

# call rate of animals
1 - (apply(M, 1, function(x) sum(x==5)) / nrow(M))

## 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1

# call rate of SNPs
1 - (apply(M, 2, function(x) sum(x==5)) / ncol(M))

##  SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 SNP10 
##     1     1     1     1     1     1     1     1     1     1

In this case we have 0% missing because it’s an example, but because of genotyping and SNP issues, many SNP/animals can fail this stage. Tissue samples can be poor or a sample can get contaminated. Some SNP just don’t anneal well to the panel. These SNP are typically replaced in newer panels for better SNP.

Next, check allele frequencies. We will cut of at 1%. This would remove anything with a frequency below 0.01 or above 0.99. For us, it removes no SNPs.

# allele frequencies
p <- apply(M, 2, mean)/2
round(p, 3)

##  SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 SNP10 
## 0.321 0.179 0.357 0.357 0.143 0.607 0.071 0.964 0.571 0.393

The following are the allele frequencies. This is typically ‘U shaped’ for closed populations under selection.

# histogram of allele frequencies
ggplot(data.p, aes(x=p)) +
  geom_histogram(bins=20, color="white") +
  ggtitle("Histogram of allele frequencies (p)") +
  xlab("Allele Frequencies") +
  theme_minimal()

Create the G matrix

First we need to center the M matrix using the means (2 times the allele frequencies).

# create P matrix
P <- matrix(rep(p*2, nrow(M)), ncol=ncol(M), nrow=nrow(M), byrow=TRUE)
rownames(P) <- rownames(M)
colnames(P) <- colnames(M)
print(P)

##         SNP1      SNP2      SNP3      SNP4      SNP5     SNP6      SNP7     SNP8     SNP9     SNP10
## 13 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 14 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 15 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 16 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 17 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 18 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 19 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 20 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 21 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 22 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 23 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 24 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 25 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143
## 26 0.6428571 0.3571429 0.7142857 0.7142857 0.2857143 1.214286 0.1428571 1.928571 1.142857 0.7857143

Create Z by subtracting P from M.

# create Z matrix
Z <- M - P
print(Z)

## 14 x 10 Matrix of class "dgeMatrix"
##          SNP1       SNP2       SNP3       SNP4       SNP5       SNP6       SNP7        SNP8
## 13  1.3571429 -0.3571429  0.2857143  0.2857143 -0.2857143 -1.2142857 -0.1428571  0.07142857
## 14  0.3571429 -0.3571429 -0.7142857 -0.7142857 -0.2857143  0.7857143 -0.1428571  0.07142857
## 15  0.3571429  0.6428571  1.2857143  0.2857143  0.7142857 -1.2142857 -0.1428571  0.07142857
## 16 -0.6428571 -0.3571429  1.2857143  0.2857143 -0.2857143 -0.2142857 -0.1428571  0.07142857
## 17 -0.6428571  0.6428571  0.2857143  1.2857143 -0.2857143 -1.2142857 -0.1428571  0.07142857
## 18  0.3571429  0.6428571 -0.7142857  0.2857143 -0.2857143  0.7857143 -0.1428571  0.07142857
## 19 -0.6428571 -0.3571429  0.2857143  0.2857143 -0.2857143  0.7857143 -0.1428571  0.07142857
## 20 -0.6428571  0.6428571  0.2857143 -0.7142857 -0.2857143 -0.2142857 -0.1428571  0.07142857
## 21  1.3571429 -0.3571429 -0.7142857 -0.7142857 -0.2857143 -0.2142857  1.8571429  0.07142857
## 22 -0.6428571 -0.3571429 -0.7142857  0.2857143  0.7142857  0.7857143 -0.1428571  0.07142857
## 23 -0.6428571  0.6428571  0.2857143 -0.7142857 -0.2857143 -0.2142857 -0.1428571  0.07142857
## 24  0.3571429 -0.3571429 -0.7142857 -0.7142857  0.7142857 -0.2142857 -0.1428571  0.07142857
## 25 -0.6428571 -0.3571429 -0.7142857  0.2857143  0.7142857  0.7857143 -0.1428571  0.07142857
## 26  0.3571429 -0.3571429  0.2857143  0.2857143 -0.2857143  0.7857143 -0.1428571 -0.92857143
##          SNP9      SNP10
## 13 -0.1428571  1.2142857
## 14 -0.1428571 -0.7857143
## 15 -0.1428571  1.2142857
## 16  0.8571429  0.2142857
## 17 -0.1428571  1.2142857
## 18  0.8571429  0.2142857
## 19  0.8571429 -0.7857143
## 20  0.8571429 -0.7857143
## 21 -0.1428571  1.2142857
## 22 -1.1428571 -0.7857143
## 23  0.8571429  0.2142857
## 24 -1.1428571 -0.7857143
## 25 -0.1428571 -0.7857143
## 26 -1.1428571 -0.7857143

Then we need to get the scaling factor (sum2pq)

# sum 2pq to scale G to the A matrix
q <- 1 - p
sum2pq <- 2*sum(p*q)  # note you can pull out the 2 (redundant)
print(sum2pq)

## [1] 3.538265

NOTE: Remember the equation for the variance of a binomial in math stats class?

\[ Var(Binomial) = n\theta(1-\theta) \]

Which looks a whole lot like

\[ Var(SNP) = 2p(1-p) \]

\[ Var(SNP) = 2pq \]

Finally, we can create the G matrix

# calculate G!
G <- (Z %*% t(Z)) / sum2pq
print(round(G, 3))

## 14 x 14 Matrix of class "dgeMatrix"
##        13     14     15     16     17     18     19     20     21     22     23     24     25
## 13  1.472 -0.446  0.988  0.059  0.685 -0.163 -0.708 -0.547  0.887 -0.789 -0.203 -0.143 -0.829
## 14 -0.446  0.745 -0.930 -0.446 -0.950  0.180  0.200  0.079  0.099  0.402 -0.143  0.483  0.362
## 15  0.988 -0.930  1.634  0.422  1.048 -0.365 -0.627 -0.183  0.120 -0.708  0.160 -0.345 -0.748
## 16  0.059 -0.446  0.422  0.907  0.402 -0.163  0.422  0.301 -0.526 -0.506  0.362 -0.708 -0.264
## 17  0.685 -0.950  1.048  0.402  1.593 -0.102 -0.365 -0.203 -0.183 -0.446  0.140 -0.647 -0.486
## 18 -0.163  0.180 -0.365 -0.163 -0.102  0.745  0.200  0.079  0.099 -0.163  0.140 -0.365  0.079
## 19 -0.708  0.200 -0.627  0.422 -0.365  0.200  0.786  0.382 -0.728  0.140  0.160 -0.345  0.382
## 20 -0.547  0.079 -0.183  0.301 -0.203  0.079  0.382  0.826 -0.567 -0.264  0.604 -0.183 -0.022
## 21  0.887  0.099  0.120 -0.526 -0.183  0.099 -0.728 -0.567  2.280 -0.526 -0.224  0.120 -0.567
## 22 -0.789  0.402 -0.708 -0.506 -0.446 -0.163  0.140 -0.264 -0.526  1.190 -0.486  0.705  0.867
## 23 -0.203 -0.143  0.160  0.362  0.140  0.140  0.160  0.604 -0.224 -0.486  0.665 -0.405 -0.244
## 24 -0.143  0.483 -0.345 -0.708 -0.647 -0.365 -0.345 -0.183  0.120  0.705 -0.405  1.068  0.382
## 25 -0.829  0.362 -0.748 -0.264 -0.486  0.079  0.382 -0.022 -0.567  0.867 -0.244  0.382  0.826
## 26 -0.264  0.362 -0.466 -0.264 -0.486 -0.203  0.099 -0.304 -0.284  0.584 -0.526  0.382  0.261
##        26
## 13 -0.264
## 14  0.362
## 15 -0.466
## 16 -0.264
## 17 -0.486
## 18 -0.203
## 19  0.099
## 20 -0.304
## 21 -0.284
## 22  0.584
## 23 -0.526
## 24  0.382
## 25  0.261
## 26  1.109

So the final equation for G is:

\[ G = \frac{ZZ'}{2\sum{pq}} \]

G is a (co)variance matrix, which means that the off-diagonals still coorespond to the numerator (or covariance) part (thus the numerator relationship matrix = A). The diagonals are the variances. Relationships can be calculated with the normal equation for a correlation:

\[ cor_{ij} = \frac{cov_{ij}}{\sqrt{var_{i}var_{j}}} \]

R is having a problem since G is not positive definite. You would use the cov2cor function in R.

# Convert (co)variance matrix to a correlation matrix for relationships
# cov2cor(G)

Condition of the matrix

The condition of the matrix represents how much redundant information there is in the marker covariates. We typically refer to this as collinearity or a problem with confounding. The G matrix is typically poorly conditioned. First we need to check the eigenvalues of the G matrix.

# eigenvalues
round(eigen(G)$values, 3)

##  [1] 6.035 3.822 2.323 1.136 1.054 0.609 0.425 0.337 0.082 0.024 0.000 0.000 0.000 0.000

Check the condition of the G matrix. Here is the condition number (largest divided by smallest eigenvalue)

# condition
kappa(G)

## [1] 4.562969e+18

Notice this is \[ 4.5 * 10^{18} \] and the smallest eigenvalue is close to 0, which means the matrix is basically not positive definite. Notice below, the Lapack library displays the inverse of the condition number.

# Try to solve G for inverse
# solve(G)
 
#Error in solve(G) : 
#  Lapack dgecon(): system computationally singular, reciprocal condition number = 2.0601e-18

Other Matrices

From VanRaden (2008), we can also look at the MM’ and M’M matrices after subtracting 1. When you subtract 1 from M, the heterozygote is now 0. So remember -1*-1=1 (this is easy to forget in think in terms of 0,1,2).

# calculate MM' (n x n , same dimention as G)
MMp <- (M-1) %*% t(M-1)
print(MMp)

## 14 x 14 Matrix of class "dgeMatrix"
##    13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 13  7  2  4  3  4  2  1  1  4  0  2  2  0  1
## 14  2  8 -1  3  0  5  6  5  3  6  4  6  6  5
## 15  4 -1  5  3  4  0  0  1  0 -1  2  0 -1 -1
## 16  3  3  3  7  4  3  6  5  0  2  5  1  3  2
## 17  4  0  4  4  7  2  2  2  0  1  3  0  1  0
## 18  2  5  0  3  2  6  5  4  2  3  4  2  4  2
## 19  1  6  0  6  2  5  8  6  0  5  5  3  6  4
## 20  1  5  1  5  2  4  6  7  0  3  6  3  4  2
## 21  4  3  0  0  0  2  0  0  8  0  1  2  0  0
## 22  0  6 -1  2  1  3  5  3  0  8  2  6  7  5
## 23  2  4  2  5  3  4  5  6  1  2  6  2  3  1
## 24  2  6  0  1  0  2  3  3  2  6  2  7  5  4
## 25  0  6 -1  3  1  4  6  4  0  7  3  5  7  4
## 26  1  5 -1  2  0  2  4  2  0  5  1  4  4  6

This is an n x n matrix. The diagonals count the number of homozygous loci and the off-diagonals measure the number of shared alleles (VanRaden 2008).

# calculate M'M (m x m)
MpM <- t(M-1) %*% (M-1)
print(MpM)

## 10 x 10 Matrix of class "dgeMatrix"
##       SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
## SNP1     9    2    0    0    3   -3    7   -5   -3     5
## SNP2     2    9    4    3    6   -4    7   -8    1     4
## SNP3     0    4    8    3    2   -5    2   -4    2     4
## SNP4     0    3    3    6    3   -2    2   -4   -1     3
## SNP5     3    6    2    3   10   -2    8   -9   -4     1
## SNP6    -3   -4   -5   -2   -2    9   -3    2    0    -8
## SNP7     7    7    2    2    8   -3   14  -11   -2     5
## SNP8    -5   -8   -4   -4   -9    2  -11   13    3    -2
## SNP9    -3    1    2   -1   -4    0   -2    3    8     1
## SNP10    5    4    4    3    1   -8    5   -2    1    11

WARNING: Do not try this at home. Allocating space to a m x m matrix will take up a lot of space. If you are using a typical laptop, it’s likely it doesn’t have enough memory for a real M matrix.

Along the diagonal, it will count the number of homozygous individuals for that SNP. Notice SNP7 is fixed at 0 or 2 for all inviduals (14). The off-diagonals give you an idea of the LD between SNPs. Notice that SNP7 and SNP8 have a -11 value off-diagonal and 0 for SNP7 usually cooresponds to a 2 for SNP8.

Alternative calculations for G

Using different allele frequencies

One simple way to create different G matrices is to center and scale based on different allele frequencies. Please see Forni, Aguilar, and Misztal (2011) for a longer discussion on this topic. We should truly have the allele frequencies from the base population, but so far this has not been possible for animal breeding programs (VanRaden (2008))

Using individual SNP weights (simple)

The equation above is the basic (and most population) way to calculate G. You can also calculate G from many different equations. The second most popular is probably:

\[ G = ZDZ' \]

We need to first create the diagonal D matrix.

# Create D
D <- diag(1 / (ncol(M)*(2*p*q)))

# column and rownames
colnames(D) <- colnames(M)
rownames(D) <- colnames(M)

# print D
print(round(D, 3))

##        SNP1  SNP2  SNP3  SNP4  SNP5 SNP6  SNP7  SNP8  SNP9 SNP10
## SNP1  0.229 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000  0.00
## SNP2  0.000 0.341 0.000 0.000 0.000 0.00 0.000 0.000 0.000  0.00
## SNP3  0.000 0.000 0.218 0.000 0.000 0.00 0.000 0.000 0.000  0.00
## SNP4  0.000 0.000 0.000 0.218 0.000 0.00 0.000 0.000 0.000  0.00
## SNP5  0.000 0.000 0.000 0.000 0.408 0.00 0.000 0.000 0.000  0.00
## SNP6  0.000 0.000 0.000 0.000 0.000 0.21 0.000 0.000 0.000  0.00
## SNP7  0.000 0.000 0.000 0.000 0.000 0.00 0.754 0.000 0.000  0.00
## SNP8  0.000 0.000 0.000 0.000 0.000 0.00 0.000 1.452 0.000  0.00
## SNP9  0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.204  0.00
## SNP10 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000  0.21

Now we can create the G matrix based on these weights.

# calculate G with the alternative formula
G <- Z %*% D %*% t(Z)
print(round(G, 3))

## 14 x 14 Matrix of class "dgeMatrix"
##        13     14     15     16     17     18     19     20     21     22     23     24     25
## 13  1.180 -0.274  0.692  0.081  0.498 -0.108 -0.490 -0.419  0.585 -0.610 -0.165 -0.107 -0.640
## 14 -0.274  0.614 -0.750 -0.293 -0.715  0.143  0.192  0.061 -0.014  0.289 -0.104  0.362  0.260
## 15  0.692 -0.750  1.401  0.270  0.810 -0.243 -0.519 -0.107 -0.120 -0.449  0.147 -0.174 -0.478
## 16  0.081 -0.293  0.270  0.741  0.317 -0.133  0.372  0.232 -0.521 -0.375  0.277 -0.540 -0.200
## 17  0.498 -0.715  0.810  0.317  1.292  0.009 -0.255 -0.061 -0.314 -0.375  0.194 -0.548 -0.404
## 18 -0.108  0.143 -0.243 -0.133  0.009  0.644  0.143  0.135 -0.065 -0.168  0.180 -0.313  0.007
## 19 -0.490  0.192 -0.519  0.372 -0.255  0.143  0.639  0.290 -0.665  0.110  0.125 -0.264  0.285
## 20 -0.419  0.061 -0.107  0.232 -0.061  0.135  0.290  0.710 -0.586 -0.239  0.545 -0.186 -0.064
## 21  0.585 -0.014 -0.120 -0.521 -0.314 -0.065 -0.665 -0.586  3.652 -0.568 -0.332 -0.056 -0.597
## 22 -0.610  0.289 -0.449 -0.375 -0.375 -0.168  0.110 -0.239 -0.568  1.024 -0.404  0.649  0.790
## 23 -0.165 -0.104  0.147  0.277  0.194  0.180  0.125  0.545 -0.332 -0.404  0.590 -0.350 -0.229
## 24 -0.107  0.362 -0.174 -0.540 -0.548 -0.313 -0.264 -0.186 -0.056  0.649 -0.350  0.932  0.416
## 25 -0.640  0.260 -0.478 -0.200 -0.404  0.007  0.285 -0.064 -0.597  0.790 -0.229  0.416  0.761
## 26 -0.224  0.228 -0.482 -0.230 -0.447 -0.229  0.038 -0.311 -0.399  0.325 -0.476  0.180  0.092
##        26
## 13 -0.224
## 14  0.228
## 15 -0.482
## 16 -0.230
## 17 -0.447
## 18 -0.229
## 19  0.038
## 20 -0.311
## 21 -0.399
## 22  0.325
## 23 -0.476
## 24  0.180
## 25  0.092
## 26  1.934

Weighted G Matrix

You can also provide alternative weights to the calculation of the G matrix in the form ZDZ’ (see above). This will account for the non-normal genetic architecture (i.e. larger QTL). This can be done iteratively by backsolving using the Z and G matrices from the MME (Wang et al. 2012). Alternatively, you can solve for marker effects using Bayesian methodology and then plug the new \(\alpha\)’s into the equation for variance \(2pq\alpha^2\). Examples could be Bayesian LASSO or BayesB. What happens is G gives almost equal weight to all markers (besides from allele frequency differences, mentioned by VanRaden 2008) so large QTLs should make individuals with the same genotype more related than an average SNP. Having a larger \(\alpha\) should give it a larger variance and count more in the G matrix. In an extreme example, you can think of a simply inherited trait. This G should have off-diagonal elements of 0, 0.5, or 1 depending on the genotype at the causative locus. Animals with genotypes aa and AA would be unrelated while animals with AA and AA would be very related in a relationship sense for that trait. Details are beyond the scope here, please see Wang et al. (2012) and Tiezzi and Maltecca (2015) for more on this topic.

Single-step GBLUP

Single-step GBLUP is a process of parsimoniously combining phenotypes, genotypes, and pedigree into equation to calculate GEBVs. Previously, genotyped individuals were used to calculate direct genomic values (DGVs) and then was combined with EBVs from traditional evaluations.

Introduction

One issue that immediately popped up when we started genotyping was that we had phenotypes on thousands (possibly millions) of animals, yet we only had genotypes on a select subset of those that may or may not have phenotypes (think Dairy bulls). It also costs money and was very expensive when genomics started (and is still quite expensive on a population basis for all candidates). Early on, Bayesians had the idea to deregress EBVs to obtain a ‘pseudo-phenotype’ for animals with genotypes (see Garrick et al., 2009). For a full discussion, see Legarra et al. (2014) for a discussion on the limitations and problems with pseudo-phenotypes from deregression. Here is a short list adapted from Legarra et al. (2014):

Close relative information is ignored in the genomic prediction
Close relative information is ignored when creating pseudo-phenotypes
Covariances among pseudo-phenotypes are typically not correctly modeled
Key parameters are difficult to obtain (e.g. precision of pseudo-phenotypes)
No feedback from genotyped to pedigree-based evaluations
Animals are selected based on known genotype, creating bias in evaluations.
Very difficult to generalize (e.g. multiple-trait generalization or maternal traits)

Split H into parts:

\[ \textbf{H} = \left [ \begin{array}{cc} \textbf{H}_{11} & \textbf{H}_{12} \\ \textbf{H}_{21} & \textbf{H}_{22} \end{array} \right ] \]

The equation for H is as follows:

\[ \textbf{H} = \left [ \begin{array}{cc} \textbf{A}_{11} + \textbf{A}_{12}\textbf{A}_{22}^{-1}\textbf{A}_{21} + \textbf{A}_{12}\textbf{A}_{22}^{-1}\textbf{G} \textbf{A}_{22}^{-1}\textbf{A}_{21} & \textbf{A}_{12}\textbf{A}_{22}^{-1}\textbf{G} \\ \textbf{G} \textbf{A}_{22}^{-1}\textbf{A}_{21} & \textbf{G} \end{array} \right ] \]

Notice how G comes into play for the \(H_{11}\), \(H_{12}\), and \(H_{21}\) parts of the equation. This indicates that the genotype relationships will change the relationships among non-genotyped and between genotyped and non-genotyped. This is a key feature of the H matrix.

Example

This example will be taken from Legarra et al. (2009) where the original derivation of single-step GBLUP (ssGBLUP) came from. The only difficult part for beginners is learning how to sort the matrices. The genotyped inviduals will be scattered throughout the pedigree.

# pedigree
animal      <- 1:17
sire        <- c(0,0,0,0,0,0,0,0,1,3,5,7,9,11,11,13,13)
dam         <- c(0,0,0,0,0,0,0,0,2,4,6,8,10,12,4,15,14)
genotyped   <- c(9,10,11,12)
  
# get number of genotyped and non-genotyped
n_genotyped     <- length(genotyped)
n_non_genotyped <- length(animal) - length(genotyped)
  
# create pedigree as data frame
ped    <- data.frame(animal, sire, dam)
  
# remove variables
rm(list=c("animal","sire","dam"))
  
# add indicator if genotyped to pedigree
ped$genotyped <- ped$animal %in% genotyped
  
# print pedigree
print(ped)

##    animal sire dam genotyped
## 1       1    0   0     FALSE
## 2       2    0   0     FALSE
## 3       3    0   0     FALSE
## 4       4    0   0     FALSE
## 5       5    0   0     FALSE
## 6       6    0   0     FALSE
## 7       7    0   0     FALSE
## 8       8    0   0     FALSE
## 9       9    1   2      TRUE
## 10     10    3   4      TRUE
## 11     11    5   6      TRUE
## 12     12    7   8      TRUE
## 13     13    9  10     FALSE
## 14     14   11  12     FALSE
## 15     15   11   4     FALSE
## 16     16   13  15     FALSE
## 17     17   13  14     FALSE

You will see later why we need to column for genotyped or not.

# order pedigree by genotyped or not
ped.ord <- ped[order(ped$genotyped), ]
  
# create A function
A <- Matrix(createA(ped))

##   Time it took (sec)            :0.000110

# row and column names
rownames(A) <- ped$animal
colnames(A) <- ped$animal
  
# invert A
A_inv <- solve(A)
  
# row and column names
rownames(A_inv) <- ped$animal
colnames(A_inv) <- ped$animal
  
# print A
print(A)

## 17 x 17 sparse Matrix of class "dsCMatrix"

##    [[ suppressing 17 column names '1', '2', '3' ... ]]

##                                                                                                    
## 1  1.000 .     .     .     .     .     .     .     0.50 .     .    .    0.2500 .     .      0.12500
## 2  .     1.000 .     .     .     .     .     .     0.50 .     .    .    0.2500 .     .      0.12500
## 3  .     .     1.000 .     .     .     .     .     .    0.500 .    .    0.2500 .     .      0.12500
## 4  .     .     .     1.000 .     .     .     .     .    0.500 .    .    0.2500 .     0.5000 0.37500
## 5  .     .     .     .     1.000 .     .     .     .    .     0.50 .    .      0.250 0.2500 0.12500
## 6  .     .     .     .     .     1.000 .     .     .    .     0.50 .    .      0.250 0.2500 0.12500
## 7  .     .     .     .     .     .     1.000 .     .    .     .    0.50 .      0.250 .      .      
## 8  .     .     .     .     .     .     .     1.000 .    .     .    0.50 .      0.250 .      .      
## 9  0.500 0.500 .     .     .     .     .     .     1.00 .     .    .    0.5000 .     .      0.25000
## 10 .     .     0.500 0.500 .     .     .     .     .    1.000 .    .    0.5000 .     0.2500 0.37500
## 11 .     .     .     .     0.500 0.500 .     .     .    .     1.00 .    .      0.500 0.5000 0.25000
## 12 .     .     .     .     .     .     0.500 0.500 .    .     .    1.00 .      0.500 .      .      
## 13 0.250 0.250 0.250 0.250 .     .     .     .     0.50 0.500 .    .    1.0000 .     0.1250 0.56250
## 14 .     .     .     .     0.250 0.250 0.250 0.250 .    .     0.50 0.50 .      1.000 0.2500 0.12500
## 15 .     .     .     0.500 0.250 0.250 .     .     .    0.250 0.50 .    0.1250 0.250 1.0000 0.56250
## 16 0.125 0.125 0.125 0.375 0.125 0.125 .     .     0.25 0.375 0.25 .    0.5625 0.125 0.5625 1.06250
## 17 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.25 0.250 0.25 0.25 0.5000 0.500 0.1875 0.34375
##           
## 1  0.12500
## 2  0.12500
## 3  0.12500
## 4  0.12500
## 5  0.12500
## 6  0.12500
## 7  0.12500
## 8  0.12500
## 9  0.25000
## 10 0.25000
## 11 0.25000
## 12 0.25000
## 13 0.50000
## 14 0.50000
## 15 0.18750
## 16 0.34375
## 17 1.00000

# print A inverse
print(A_inv)

## 17 x 17 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 17 column names '1', '2', '3' ... ]]

##                                                                                    
## 1   1.5  0.5  .    .    .    .    .    .   -1.0  .    .    .    .    .    .    .  .
## 2   0.5  1.5  .    .    .    .    .    .   -1.0  .    .    .    .    .    .    .  .
## 3   .    .    1.5  0.5  .    .    .    .    .   -1.0  .    .    .    .    .    .  .
## 4   .    .    0.5  2.0  .    .    .    .    .   -1.0  0.5  .    .    .   -1.0  .  .
## 5   .    .    .    .    1.5  0.5  .    .    .    .   -1.0  .    .    .    .    .  .
## 6   .    .    .    .    0.5  1.5  .    .    .    .   -1.0  .    .    .    .    .  .
## 7   .    .    .    .    .    .    1.5  0.5  .    .    .   -1.0  .    .    .    .  .
## 8   .    .    .    .    .    .    0.5  1.5  .    .    .   -1.0  .    .    .    .  .
## 9  -1.0 -1.0  .    .    .    .    .    .    2.5  0.5  .    .   -1.0  .    .    .  .
## 10  .    .   -1.0 -1.0  .    .    .    .    0.5  2.5  .    .   -1.0  .    .    .  .
## 11  .    .    .    0.5 -1.0 -1.0  .    .    .    .    3.0  0.5  .   -1.0 -1.0  .  .
## 12  .    .    .    .    .    .   -1.0 -1.0  .    .    0.5  2.5  .   -1.0  .    .  .
## 13  .    .    .    .    .    .    .    .   -1.0 -1.0  .    .    3.0  0.5  0.5 -1 -1
## 14  .    .    .    .    .    .    .    .    .    .   -1.0 -1.0  0.5  2.5  .    . -1
## 15  .    .    .   -1.0  .    .    .    .    .    .   -1.0  .    0.5  .    2.5 -1  .
## 16  .    .    .    .    .    .    .    .    .    .    .    .   -1.0  .   -1.0  2  .
## 17  .    .    .    .    .    .    .    .    .    .    .    .   -1.0 -1.0  .    .  2

Now we need to allocate the different subsections of the A matrix.

# subset A matrix into parts for non-genotyped (1) and genotyped (2)
A_11 <- A[ped$genotyped != 1, ped$genotyped != 1]
A_22 <- A[ped$genotyped == 1, ped$genotyped == 1]
A_12 <- A[ped$genotyped != 1, ped$genotyped == 1]
A_21 <- t(A_12)

Notice \(A_{22}\) comes up a lot in single-step papers.

Now we need to sort the A matrix by genotyped or not

# sort A by non-genotyped and genotyped
A_shuff <- A[order(ped$genotyped), order(ped$genotyped)]

Now enter the G matrix.

# G matrix
G <- Matrix(matrix(c(1, .7, .7, .7,
              .7, 1, .7, .7,
              .7, .7, 1, .7,
              .7, .7, .7, 1), nrow=4))
  
# row and column names of G
rownames(G) <- ped$animal[ped$genotyped==1]
colnames(G) <- ped$animal[ped$genotyped==1]
  
print(G)

## 4 x 4 Matrix of class "dsyMatrix"
##      9  10  11  12
## 9  1.0 0.7 0.7 0.7
## 10 0.7 1.0 0.7 0.7
## 11 0.7 0.7 1.0 0.7
## 12 0.7 0.7 0.7 1.0

Finally we can integrate in the G matrix to calculate H.

# create H components (4 quadrants)
H_11 <- Matrix(A_11 + (A_12 %*% solve(A_22) %*% (G - A_22) %*% solve(A_22) %*% A_21))
H_12 <- Matrix(A_12 %*% solve(A_22) %*% G)
H_21 <- Matrix(t(H_12))
H_22 <- G
  
# put H together by peices
H_top    <- cBind(H_11, H_12)
H_bottom <- cBind(H_21, H_22)
H        <- rBind(H_top, H_bottom)
  
# reorder H
H <- H[order(as.numeric(rownames(H))), order(as.numeric(colnames(H)))]

# convert to Matrix
H <- Matrix(H)

# H in paper (rounded to 2 places)
print(round(H, 3))

## 17 x 17 Matrix of class "dgeMatrix"
##        1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16
## 1  1.000 0.000 0.175 0.175 0.175 0.175 0.175 0.175 0.500 0.350 0.350 0.350 0.425 0.350 0.262 0.344
## 2  0.000 1.000 0.175 0.175 0.175 0.175 0.175 0.175 0.500 0.350 0.350 0.350 0.425 0.350 0.262 0.344
## 3  0.175 0.175 1.000 0.000 0.175 0.175 0.175 0.175 0.350 0.500 0.350 0.350 0.425 0.350 0.175 0.300
## 4  0.175 0.175 0.000 1.000 0.175 0.175 0.175 0.175 0.350 0.500 0.350 0.350 0.425 0.350 0.675 0.550
## 5  0.175 0.175 0.175 0.175 1.000 0.000 0.175 0.175 0.350 0.350 0.500 0.350 0.350 0.425 0.338 0.344
## 6  0.175 0.175 0.175 0.175 0.000 1.000 0.175 0.175 0.350 0.350 0.500 0.350 0.350 0.425 0.338 0.344
## 7  0.175 0.175 0.175 0.175 0.175 0.175 1.000 0.000 0.350 0.350 0.350 0.500 0.350 0.425 0.262 0.306
## 8  0.175 0.175 0.175 0.175 0.175 0.175 0.000 1.000 0.350 0.350 0.350 0.500 0.350 0.425 0.262 0.306
## 9  0.500 0.500 0.350 0.350 0.350 0.350 0.350 0.350 1.000 0.700 0.700 0.700 0.850 0.700 0.525 0.688
## 10 0.350 0.350 0.500 0.500 0.350 0.350 0.350 0.350 0.700 1.000 0.700 0.700 0.850 0.700 0.600 0.725
## 11 0.350 0.350 0.350 0.350 0.500 0.500 0.350 0.350 0.700 0.700 1.000 0.700 0.700 0.850 0.675 0.688
## 12 0.350 0.350 0.350 0.350 0.350 0.350 0.500 0.500 0.700 0.700 0.700 1.000 0.700 0.850 0.525 0.612
## 13 0.425 0.425 0.425 0.425 0.350 0.350 0.350 0.350 0.850 0.850 0.700 0.700 1.350 0.700 0.562 0.956
## 14 0.350 0.350 0.350 0.350 0.425 0.425 0.425 0.425 0.700 0.700 0.850 0.850 0.700 1.350 0.600 0.650
## 15 0.262 0.262 0.175 0.675 0.338 0.338 0.262 0.262 0.525 0.600 0.675 0.525 0.562 0.600 1.175 0.869
## 16 0.344 0.344 0.300 0.550 0.344 0.344 0.306 0.306 0.688 0.725 0.688 0.612 0.956 0.650 0.869 1.413
## 17 0.387 0.387 0.387 0.387 0.387 0.387 0.387 0.387 0.775 0.775 0.775 0.775 1.025 1.025 0.581 0.803
##       17
## 1  0.387
## 2  0.387
## 3  0.387
## 4  0.387
## 5  0.387
## 6  0.387
## 7  0.387
## 8  0.387
## 9  0.775
## 10 0.775
## 11 0.775
## 12 0.775
## 13 1.025
## 14 1.025
## 15 0.581
## 16 0.803
## 17 1.525

Notice however the difference between the A and H inverses.

# inverse H
H_inv <- Matrix(round(solve(H),3))

print(H_inv)

## 17 x 17 sparse Matrix of class "dsCMatrix"
##                                                                                               
##  [1,]  1.5  0.5  .    .    .    .    .    .   -1.000  .      .      .      .    .    .    .  .
##  [2,]  0.5  1.5  .    .    .    .    .    .   -1.000  .      .      .      .    .    .    .  .
##  [3,]  .    .    1.5  0.5  .    .    .    .    .     -1.000  .      .      .    .    .    .  .
##  [4,]  .    .    0.5  2.0  .    .    .    .    .     -1.000  0.500  .      .    .   -1.0  .  .
##  [5,]  .    .    .    .    1.5  0.5  .    .    .      .     -1.000  .      .    .    .    .  .
##  [6,]  .    .    .    .    0.5  1.5  .    .    .      .     -1.000  .      .    .    .    .  .
##  [7,]  .    .    .    .    .    .    1.5  0.5  .      .      .     -1.000  .    .    .    .  .
##  [8,]  .    .    .    .    .    .    0.5  1.5  .      .      .     -1.000  .    .    .    .  .
##  [9,] -1.0 -1.0  .    .    .    .    .    .    4.081 -0.253 -0.753 -0.753 -1.0  .    .    .  .
## [10,]  .    .   -1.0 -1.0  .    .    .    .   -0.253  4.081 -0.753 -0.753 -1.0  .    .    .  .
## [11,]  .    .    .    0.5 -1.0 -1.0  .    .   -0.753 -0.753  4.581 -0.253  .   -1.0 -1.0  .  .
## [12,]  .    .    .    .    .    .   -1.0 -1.0 -0.753 -0.753 -0.253  4.081  .   -1.0  .    .  .
## [13,]  .    .    .    .    .    .    .    .   -1.000 -1.000  .      .      3.0  0.5  0.5 -1 -1
## [14,]  .    .    .    .    .    .    .    .    .      .     -1.000 -1.000  0.5  2.5  .    . -1
## [15,]  .    .    .   -1.0  .    .    .    .    .      .     -1.000  .      0.5  .    2.5 -1  .
## [16,]  .    .    .    .    .    .    .    .    .      .      .      .     -1.0  .   -1.0  2  .
## [17,]  .    .    .    .    .    .    .    .    .      .      .      .     -1.0 -1.0  .    .  2

Notice the sparce nature of the H inverse matrix.

Next, look at the difference between the two inverses.

# differnce between H and A inverse
diff <- Matrix(round(H_inv - A_inv, 2))
diff <- as(diff, "sparseMatrix")

# differnce between H and A inverse
print(round(diff, 2))

## 17 x 17 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 17 column names '1', '2', '3' ... ]]

##                                                     
## 1  0 0 . . . . . .  0.00  .     .     .    . . . . .
## 2  0 0 . . . . . .  0.00  .     .     .    . . . . .
## 3  . . 0 0 . . . .  .     0.00  .     .    . . . . .
## 4  . . 0 0 . . . .  .     0.00  0.00  .    . . 0 . .
## 5  . . . . 0 0 . .  .     .     0.00  .    . . . . .
## 6  . . . . 0 0 . .  .     .     0.00  .    . . . . .
## 7  . . . . . . 0 0  .     .     .     0.00 . . . . .
## 8  . . . . . . 0 0  .     .     .     0.00 . . . . .
## 9  0 0 . . . . . .  1.58 -0.75 -0.75 -0.75 0 . . . .
## 10 . . 0 0 . . . . -0.75  1.58 -0.75 -0.75 0 . . . .
## 11 . . . 0 0 0 . . -0.75 -0.75  1.58 -0.75 . 0 0 . .
## 12 . . . . . . 0 0 -0.75 -0.75 -0.75  1.58 . 0 . . .
## 13 . . . . . . . .  0.00  0.00  .     .    0 0 0 0 0
## 14 . . . . . . . .  .     .     0.00  0.00 0 0 . . 0
## 15 . . . 0 . . . .  .     .     0.00  .    0 . 0 0 .
## 16 . . . . . . . .  .     .     .     .    0 . 0 0 .
## 17 . . . . . . . .  .     .     .     .    0 0 . . 0

Notice the differnce between A inverse is sparce, H inverse would be pretty sparce depending on the proportion of genotyped to non-genotyped animals.

This is where Aguilar’s (2010) equation comes from.

\[ \textbf{H}^{-1} = \textbf{A}^{-1} + \left [ \begin{array}{cc} 0 & 0 \\ 0 & \textbf{G}^{-1}-\textbf{A}_{22}^{-1} \end{array} \right ] \]

If you simply subtract A inverse from H inverse you get the result above.

For a final version of the H inverse:

\[ \textbf{H}^{-1} = \textbf{A}^{-1} + \left [ \begin{array}{cc} 0 & 0 \\ 0 & \tau(\alpha \textbf{G} + \beta \textbf{A}_{22})^{-1} - \omega \textbf{A}_{22}^{-1} \end{array} \right ] \]

First you can weight a new G matrix using \(A_{22}\), typically to add a polygenic component or make it more computationally stable (improve the condition). Then one can alter the weights of the G and \(A_{22}\) inverses. Typical values are 0.95, 0.05, 1, and 1 for \(\alpha\), \(\beta\), \(\tau\), and \(\omega\), respectively. Although Lourenco et al. (2014) used values of 0.95, 0.05, 1, and 0.70.

Compare results to Legarra’s 2009 paper

You can view each matrix in the following example from Legarra et al. (2009).

This is the A matrix.

The following is the WRONG H matrix after simply plugging in G. This will correct the future relationships among non-genotyped offspring and relatives, however, it’s not correct. Legarra calls these inconsistencies in the pedigree relationships. If the individuals genotyped have a relationsip of 0.70 then there is no way that the ancestors are unrelated. The correct H is needed to correct for this.

Finally, you have the correct H matrix that adjusts the pedigree to make the entire matrix make sense. So it not only tracks forward through the pedigree, but ‘fixes’ past relationships.

Care must be taken when removing genotypes for this reason. Although theoretically it should increase the accuracy of non-genotyped inviduals, it has shown very marginal improvement in practice.

MME for single-step GBLUP

The mixed model equations are the same, except for we replace A or G with H. This matrix will be of equal dimentions as A, but typically much larger than G.

\[ \left [ \begin{array}{cc} \boldsymbol{X'X} & \boldsymbol{X'Z} \\ \boldsymbol{Z'X} & \boldsymbol{Z'Z} + \boldsymbol{H}^{-1} \lambda \\ \end{array} \right ] \left [ \begin{array}{c} \boldsymbol{\beta} \\ \boldsymbol{u} \end{array} \right ] = \left [ \begin{array}{c} \boldsymbol{X'y} \\ \boldsymbol{Z'y} \end{array} \right ] \]

Issues with single-step

Convergence is poor typically for BLUP and convergence in AIREML or REML algorithms, especially with imputed genotypes (personal experience).
DGV’s from backsolving can end in biased estimates (not on same scale). Not likely in the swine industry, but possible in other species.
May give biased genetic trends (see Vitezica, 2011). They derived a random effect to go into the MME that was comparable to Fernando and Garrick’s J matrix that is added to the X matrix. Genotyped individuals may be a highly selected subset.
Once n is greater than m (number of animals exceeds the number of markers), the matrix G becomes non-positive definite and cannot be inverted for the mixed model equations. We can (1) impute to a higher marker density or (2) one can simply blend A and G as is already done in ssGBLUP. Although Bayesians don’t seem to like this approach.
Computationally, inverting a dense G matrix will be difficult even with powerful computers. It can be a rate limiting step depending on the computing available. Most commercial companies have plenty of computing power, but for large breeding programs the number genotypes has quickly risen to a large number (over a million in dairy). In ‘big O notation’, the complexity for inversion is O(\(n^3\)) and at best O(\(n^2.373\)) with optimization.
It’s common for Bayesians to say that GBLUP cannot handle other genetic architectures (large QTL), however as presented above, this is not quite true. It may not do quite as good of a job as some Bayesian methods however. One limitation to a weighted G matrix is that it’s trait specific so using it in multivariate BLUP situations is still limited.

Solutions for some issues

Convergence has been worked on by Masuda et al. (2017). The PCG algorithm had to be updated to work with ssGBLUP.
One issue was not accounting for inbreeding. This created issues trying to match A and G. BLUPF90 now has capabilities to deal with inbreeding.
As mentioned above, we can weight the G matrix in single-trait analyses to account for non-normal genetic architecture.
Computational issues are still kind of an issue, but as always in research, they are under development still. The APY algorithm has been developed to create an inverse of G from a ‘core’ set and then the rest of the inverse is created at a linear cost. Accuracies from paper’s testing this algorithm have shown that the GEBVs are very close to 1:1 to the full H matrix GEBVs.

Response from Bayesians

Rohan Fernando, Jack Dekkers, and Dorian Garrick (2014) came up with a way to explicitly impute genotype genotypes using a similar process that was used to derive ssGBLUP. Single-step GBLUP does this imputation implicitly and instead imputes covariances between relatives.

Software to handle ssGBLUP and GBLUP

BLUPF90 was designed by Ignacy Misztal with major help from Shogo Tsuruta and Ignacio Aguilar (genomics programs). BLUPF90 is extremely efficient and designed specifically for animal breeding applications. GBLUP can be accomplished by adding a dummy pedigree with 0’s as parents. Very easy to learn and implement standard models.
ASReml is a nice software for more general statistics. It can be used to apply many complex models. The learning curve is quite steep and it is not designed specifically for animal breeding applications, although it can handle pedigree analysis easily. GBLUP is fairly easy to attain if you provide G directly in a lower triagular matrix and renumber all animals 1,2,…,n to coorespond after subsetting to only genotyped animals in the dataset. This is time consuming and not good for routine evaluations.
DMU is focused on animal breeding and does have full single-step GBLUP capability I believe. I don’t know much about this software. They have a standalone pedigree software DMUtrace.
MiXBLUP is being implemented in some of the Canadian swine industry. It has full single-step capabilities I believe.
BOLT works differently, but implements a single-step Bayesian approach using the same theory used to derive the H matrix that explicetely. There is no actual website for BOLT at this time (to my knowledge). Many of the same tools are implemented in Julia from Hao Cheng (now at UC Davis) called JWAS

A work by Austin Putz

putz.austin@gmail.com

References

Forni, S., I. Aguilar, and I. Misztal. 2011. “Different Genoimc Relationship Matrices for Single-Step Analysis Using Phenotypic, Pedigree and Genomic Information.” Genet. Sel. Evol. 43 (1).

Lourenco, D. A. L., B. O. Fragomeni, S. Tsuruta, I. Aguilar, B. Zumbach, R. J. Hawken, A. Legarra, and I. Misztal. 2015. “Accuracy of Estiamted Breeding Values with Genomic Infomraiton on Males, Females, or Both: An Example on Broiler Chicken.” Genet. Sel. Evol. 47 (56).

Lourenco, D. A. L., I. Misztal, S. Tsuruta, T. J. Lawlor, S. Forni, and J. I. Weller. 2014. “Are Evaluations on Young Genotyped Animlas Benefiting from the Past Generations.” J. Dairy. Sci. 97: 3930–42.

VanRaden, P. M. 2008. “Efficient Methods to Compute Genomic Predictions.” J. Dairy. Sci. 91: 4414–23.

AnS 562A - Introduction to GBLUP and single-step GBLUP

Austin Putz

4/10/2018

Introduction to GBLUP

Model

MME

GBLUP

Example

Processing the marker matrix (M)

Create the G matrix

Condition of the matrix

Other Matrices

Alternative calculations for G

Using different allele frequencies

Using individual SNP weights (simple)

Weighted G Matrix

Single-step GBLUP

Introduction

Example

Compare results to Legarra’s 2009 paper

MME for single-step GBLUP

Issues with single-step

Solutions for some issues

Response from Bayesians

Software to handle ssGBLUP and GBLUP

References