Required packages

require(gaston)
require(EMMREML)
require(tictoc)

Introduction

Here, I will explain several statistical models used in GWAS, performing actual calculations in R, based on the chapter on GWAS (https ://link.springer.com/content/pdf/10.1007/978-3-030-83940-6_19) in the excellent textbook Quantitative Genetics (Shizhong Xu 2022, Springer, https://doi.org/10.1007/978-3-030-83940-6).

Data preparation

Loading data

First, marker genotype data in vcf format (McCouch et al. 2016) are loaded.

bm <- read.vcf("HDRA-G6-4-RDP1-RDP2-NIAS.AGCT_MAF005MIS001NR_imputed.vcf.gz")

## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.

bm # show the summary of the file

## A bed.matrix with 1568 individuals and 38769 markers.
## snps stats are set
## ped stats are set

bm contains SNP genotypes coded as scores, 0, 1, and 2.

as.matrix(bm[1:10, 1:4])

##                       SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## IRGC121316@c88dcbba.0           0            0            0            0
## IRGC121578@950ba1f2.0           0            0            0            0
## IRGC121491@0b7d031b.0           0            0            0            2
## IRGC121440@d02dfe4e.0           0            0            0            2
## IRGC121896@89688646.0           0            0            0            0
## IRGC121323@57c71a3b.0           0            0            0            0
## IRGC121251@392016d6.0           0            0            0            0
## IRGC121902@f1be6ab1.0           0            0            0            1
## IRGC124483@9a9187e6.0           0            0            0            0
## IRGC121314@6c2404e4.0           0            0            0            0

Read in the attribute information and phenotypic data

Next, read in the phenotypic data obtained for the same genetic resource accession.

# read phenotypic data
pheno <- read.csv("RiceDiversityPheno4GWASGS.csv", row.names = 1)

Here, seed length is used as an example trait.

y <- pheno[, "Seed.length"] # select seed length as the target trait
names(y) <- rownames(pheno) # naming the elements of y
y <- na.omit(y) # remove missing elements

Extract marker genotype data for analysis

Since phenotypic data are not available for all lines, SNP data for lines with phenotypic data are extracted from bm. Here, the names given to y are used.

# select samples with ids contained in rownames(pheno)
bm.wp <- bm[bm@ped$id %in% names(y),]

Remove monomorphic markers

bm.wp <- select.snps(bm.wp, maf > 0.01)
bm.wp

## A bed.matrix with 361 individuals and 38752 markers.
## snps stats are set
## ped stats are set

Finally, the phenotype data $y$ are sorted in the same order as the SNP data bm.wp.

y <- y[bm.wp@ped$id] # reorder the elements of y as same as the bm@ped$id

Various types of regression analysis

Simple regression analysis

The model for a single regression analysis can be written as \[ y = X\beta + Z_k \gamma_k + e \] where $X$ is the design matrix of fixed effects (year, study site, etc.) including the overall mean, $\beta$ is the fixed effect, $Z_k$ is the scored genotype (-1, 0, 1) of the $k$th marker, $\gamma_k$ is the effect of marker $k$, $e$ is the error term including environmental errors and other SNPs’ effects.

Here we create a design matrix for the over-all mean.

n <- length(y)
X <- matrix(1, nrow = n, ncol = 1)
head(X)

##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1
## [5,]    1
## [6,]    1

Note that the $j$th element of $Z_k$ represents the genotype of the $k$th marker of the $j$th genotype (accession, line, variety etc), as mentioned earlier. \[ Z_{jk} = \left\{ \begin{array}{ll} -1 & \rm{for}\, A_1 A_1 \\ 0 & \rm{for}\, A_1 A_2 \\ 1 & \rm{for}\, A_2 A_2 \end{array} \right. \]

In bm.wp, it is scored at 0, 1, 2, so subtract 1 from the score to -1, 0, 1.

Z <- as.matrix(bm.wp) - 1
head(Z[,1:5])

##                     SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## NSFTV80@02d095ba.0           -1           -1           -1           -1
## NSFTV87@64fa0112.0           -1           -1           -1           -1
## NSFTV96@32a6808e.0           -1           -1           -1           -1
## NSFTV100@1f10be3d.0          -1           -1           -1           -1
## NSFTV114@eac19fb8.0          -1           -1           -1           -1
## NSFTV140@85551f9c.0          -1           -1           -1           -1
##                     SNP-1.29213.
## NSFTV80@02d095ba.0            -1
## NSFTV87@64fa0112.0            -1
## NSFTV96@32a6808e.0            -1
## NSFTV100@1f10be3d.0           -1
## NSFTV114@eac19fb8.0           -1
## NSFTV140@85551f9c.0           -1

Estimates of the regression coefficients $\beta$ and $\gamma_k$ are obtained by the following matrix calculation. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \begin{bmatrix} X^Ty \\ Z_k^T y \end{bmatrix} \]

Let’s estimate the regression coefficients for the 1000th SNP (k = 1000).

k <- 1000
Zk <- Z[, k]

Perform a single regression analysis with R lm function.

summary(lm(y ~ Zk))

## 
## Call:
## lm(formula = y ~ Zk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2144 -0.6674 -0.1401  0.6565  4.2692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.59912    0.05932 144.963  < 2e-16 ***
## Zk           0.31948    0.05957   5.363 1.47e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9148 on 359 degrees of freedom
## Multiple R-squared:  0.07418,    Adjusted R-squared:  0.0716 
## F-statistic: 28.76 on 1 and 359 DF,  p-value: 1.467e-07

Let’s try to find it by matrix calculation. First, we prepare the matrix C to compute the inverse of C.

xtx <- t(X) %*% X
xtz <- t(X) %*% Zk
ztx <- t(Zk) %*% X
ztz <- t(Zk) %*% Zk

C <- matrix(c(xtx, xtz, ztx, ztz), nrow = 2, ncol = 2, byrow = T)
C

##      [,1] [,2]
## [1,]  361 -210
## [2,] -210  358

Calculate the inverse of C matrix.

Cinv <- solve(C)
Cinv

##             [,1]        [,2]
## [1,] 0.004204938 0.002466584
## [2,] 0.002466584 0.004240175

Compute estimates of the regression coefficients using the inverse matrix.

xty <- t(X) %*% y
zty <- t(Zk) %*% y

theta <- Cinv %*% matrix(c(xty, zty), nrow = 2, ncol = 1)
theta

##          [,1]
## [1,] 8.599119
## [2,] 0.319478

The error variance $\sigma^2$ can be estimated as \[ \hat{\sigma^2}=\frac{1}{n-q-1}(y-X\hat{\beta}-Z_k\hat{\gamma}_k)^T(y-X\hat{\beta}-Z_k\hat{\gamma}_k) \]

To easily calculate $X\hat{\beta}-Z_k\hat{\gamma}_k$, $X$ and $Z_k$ are bundled together as $T_k$. Then $X\hat{\beta}-Z_k\hat{\gamma}_k$ can be calculated as $T_k \hat{\theta}$.

Tk <- cbind(X, Zk)

Now let’s calculate the error variance.

s2 <- (1 / (n - ncol(X) -  1)) * 
  t(y - Tk %*% theta) %*% (y - Tk %*% theta)
s2 <- drop(s2); s2

## [1] 0.8368293

The variance of the estimated regression coefficients can be determined based on this estimate of the error variance. \[ \rm{var} \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} \rm{var}(\hat{\beta}) & \rm{cov}(\hat{\beta}, \hat{\gamma}_k) \\ \rm{cov}(\hat{\gamma}_k, \hat{\beta}) & \rm{var}(\hat{\gamma}_k) \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \sigma^2 = \begin{bmatrix} C_{XX} & C_{XZ} \\ C_{ZX} & C_{ZZ} \end{bmatrix} \sigma^2 \]

The actual calculation is

v_theta = Cinv * s2
v_theta

##             [,1]        [,2]
## [1,] 0.003518815 0.002064109
## [2,] 0.002064109 0.003548302

A test for whether the regression coefficient $\gamma_k$ is zero can be performed using the following statistic \[ W_k = \frac{\hat{\gamma}_k^2}{\rm{var}(\hat{\gamma_k})} \] This is called the Wald test.

Let’s actually calculate the statistic.

gammak <- theta[2]
v_gammak <- v_theta[2,2]
Wk <- gammak^2 / v_gammak
Wk

## [1] 28.7648

This statistic follows a chi-square distribution with one degree of freedom under the null hypothesis. Thus, $p$-value can be calculated as \[ p_k = 1 - \rm{Pr}(\chi_1^2 < Wk) \]

Let’s calculate it.

pk <- 1 - pchisq(Wk, df = 1)
pk

## [1] 8.172313e-08

Note that in GWAS, the exponent ($-\log10(p)$) is often used by taking the ordinary logarithm of the $p$ value and multiplying it by $-1$. \[ \eta_k = -\log_{10}(p_k) \]

The actual calculation is

-log10(pk)

## [1] 7.087655

More precisely, $W_k$ follows an $F$ distribution with degrees of freedom $1$, $n - q - 1$, where $q$ denotes the number of markers, under the null hypothesis.

pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk

## [1] 1.46714e-07

-log10(pk)

## [1] 6.833529

This result can be obtained as a result of an analysis of variance ($F$ test is performed) for a single regression model using $lm$.

model <- lm(y ~ Zk)
anova(model)

## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Zk          1  24.071 24.0712  28.765 1.467e-07 ***
## Residuals 359 300.422  0.8368                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s run a simple regression analysis on all markers using the association.test function in gaston.

gwas.res <- association.test(bm.wp, Y = y, method = "lm", response = "quantitative", test = "wald")

The results of the analysis of the 1000th marker are

gwas.res[1000, ]

##      chr     pos             id A1 A2    freqA2     beta         sd           p
## 1000   1 6829864 SNP-1.6828863.  C  T 0.2091413 0.319478 0.05956763 1.46714e-07

sd is the standard deviation of the estimated $\hat{\gamma_k}$ (i.e., standard error). This can be computed as

sqrt(v_theta[2,2])

## [1] 0.05956763

The heritability calculated by this marker effect alone is, \[ h_k^2 = \frac{\sigma^2_{G_k}}{\sigma^2_{G_k}+\sigma^2} = \frac{\rm{var}(Z_k)\gamma_k^2}{\rm{var}(Z_k)\gamma_k^2+\sigma^2} = \frac{\sigma_{Z_k}^2\gamma_k^2}{\sigma_{Z_k}^2\gamma_k^2+\sigma^2} \] where $\rm{var}(Z_k) = \sigma_{Z_k}^2$ and $\sigma^2_{G_k} = \sigma_{Z_k}^2\gamma_k^2$.

Let’s do the actual calculations.

sgk2 <- var(Zk) * gammak^2
hk2 <- sgk2 / (sgk2 + s2); hk2

## [1] 0.07399023

Mixed linear model based on relatedness inferred from marker genotypes

This model, called the Q+K model, is one of the most commonly used GWAS models for quantitative traits. The model equation is as follows. \[ y = Q\alpha + X\beta + Zu + e \] Here, $Q$ consists of a numerical value that expresses the population structure. For example, the probability of belonging to a subpopulation as estimated by Structure analysis (Pritchard et al. 2000) is used as $Q$. The $\alpha$ represents the effect of each subpopulation on the phenotype and is estimated as a regression coefficient. $X$ represents the marker genotype of interest and $\beta$ is its effect. The $u$ represents the polygene effect distributed across the genome, and $Z$ is the design matrix to link the effect to each sample. If each sample is measured once, $Z$ becomes the unit matrix. The interpretation of $E$ is the same as before.

Under this model, the expected value of $Y$ is, \[ E(y) = Q\hat{\alpha} + X\hat{\beta} \] and the variance is, \[ var(y) = V = ZKZ^T \sigma_u^2 + I\sigma^2 \] \[ W = \frac{\hat{\beta}^2}{var(\hat{\beta})} \]

Efficient Computational Method: EMMA (Efficient Mixed Model Association)

The previous model is rewritten and represented as \[ y = X\beta + Z_k \gamma_k + \xi + \epsilon \] where $X\beta$ is the fixed effect and represents the grand mean or sub-population structure. This is represented as $Q\alpha$ in the previous model. $Z_k \gamma_k$ is the effect of k-th marker, which is the marker of interest. $Z_k$ represent the status of the genotypes of the marker. $\xi$ is the random effect which is represented as $Zu$ in the previous model. $\epsilon$ is the error term.

We assume the random effect $xi$ (which is a vector with the length of sample number) follows the multi normal distribution with mean $\mathbf0$ (which is a zero vector with the length of sample number) and variance \[ var(\xi) = K \phi^2 \]

This random effect is assumed to be caused by polygenic effects of all markers, thus \[ \xi = \sum_{k=1}^m Z_k \delta_k \] where $Z_k$ is the score representing the genotype of k-th marker. $\delta_k$ is the polygenic effect of the k-th marker and follows a normal distribution \[ \delta_k \sim N(0, \varphi^2) \] Under this assumption, the variance of $\xi$ can be calculated as \[ var(\xi) = \sum_{k=1}^m Z_k var(\delta_k) Z_k^T = \sum_{k=1}^m Z_k Z_k^T \phi^2 = K\phi^2 \] where \[ K = \sum_{k=1}^m Z_k Z_k^T \]

Under the above model, the expected value of $y$ is \[ E(y) = X\beta + Z_k\gamma_k \] and the variance of $y$ is \[ var(y) = K\phi^2 + I\sigma^2 = (K\lambda + I) \sigma^2 \] where \[ \lambda = \phi^2 / \sigma^2 \]

Now, for the variables that are the source of the fixed effects,et us denote them collectively as \[ T_k = (X||Z_k) \] then the above model is represented as \[ y = T_k \theta_k + \xi + \epsilon \] where \[ \theta = (\beta^T, \gamma_k)^T \]

The profile likelihood for $\lambda$ (the likelihood of profiling out parameters other than $lambda$) of the above model is, \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | - \frac{n-q-1}{2} \ln (y^T P_k y) + const \] where \[ H = K\lambda + I \\ P_k = H^{-1} - H^{-1} T_k (T_k^T H^{-1} T_k)^{-1} T_k^T H^{-1} \]

Once the $\lambda$ that maximizes the profile likelihood above is found, $\hat{\theta}$, that is, $\hat{\beta}$ and $\hat{\gamma_k}$, are estimated as follows. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \begin{bmatrix} X^TH^{-1}y \\ Z_k^TH^{-1}y \end{bmatrix} \] The error variance $\sigma^2$ is estimated as \[ \hat{\sigma}^2 = \frac{1}{n-q-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k)^{T} H^{-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k) \]

The variance of the estimates of $\beta$ and $\gamma_k$ are \[ var \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \hat{\sigma}^2 = \begin{bmatrix} var(\hat{\beta}) & cov(\hat{\beta}, \hat{\gamma}_k) \\ cov(\hat{\gamma}_k, \hat{\beta}) & var(\hat{\gamma}_k) \end{bmatrix} \]

The test of significance of the marker effect $\hat{\gamma}_k$ is performed using the Wald test based on the estimates and their variances. \[ W_k = \frac{\hat{\gamma}_k^2}{var(\hat{\gamma}_k )} \]

From here, I will explain how to efficiently perform the calculations shown above using the eigenvalue decomposition of $K$: \[ K = UDU^T \]

Using the above equation for $H$, we derive, \[ H = K\lambda + I = UDU^T + I = U(D\lambda + I) U^T \] The determinant of the product of matrices is the product of the determinants of the matrices, i.e., $|AB|=|A|||B|$. Using this, \[ \ln |H| = \ln |U(D\lambda + I)U^T| = \ln |U||D\lambda + I||U^T| \\ = \ln |U||U^T||D\lambda + I| = \ln |UU^T||D\lambda + I| \\ =\ln |D\lambda + I | = \sum_{j=1}^n \ln (d_j\lambda + 1) \]

Using the above relationships, the exponential part of the profile likelihood \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | - \frac{n-q-1}{2} \ln (y^T P_k y) + const \] can be derived as \[ y^T(H^{-1} - H^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1})y = \\ y^TH^{-1}y - y^TH^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1}y \] This can be computed by combining $y^TH^{-1}y, y^TH^{-1}T_k, T_k^TH^{-1}T_k$. Here, let \[ y^* = U^Ty\\ T_k^* = U^TT_k \] then \[ H^{-1} = U(D\lambda + I)^{-1} U^T \] so, \[ y^TH^{-1}y = y^T U(D\lambda + I)^{-1} U^T y = y^{*T}(D\lambda + I)^{-1}y^*\\ y^TH^{-1}T_k = y^T U(D\lambda + I)^{-1} U^T T_k = y^{*T}(D\lambda + I)^{-1}T_k^*\\ T_k^TH^{-1}T_k = T_k^T U(D\lambda + I)^{-1} U^T T_k = T_k^{*T}(D\lambda + I)^{-1}T_k^* \]

Note that since $D\lambda + I$ is a diagonal matrix, these can be calculated as follows. \[ a^T H^{-1} b = a^T U(D\lambda + I)^{-1} U^Tb=a^{*T}(D\lambda+I)^{-1}b^{*} = \sum_{j=1}^n a_j^{*T}b_j^{*}(d_j\lambda + 1)^{-1} \]

From here, we will perform some actual calculations based on the equations given above.

The first step is to compute $ZZ^T$ and its eigenvalue decomposition.

K <- tcrossprod(Z)
eigK <- eigen(K)

From the eigenvalue decomposition results, take out $U$ and $D$. $U$ is a matrix consisting of eigenvectors and $D$ is a matrix with eigenvalues on its diagonal.

U <- eigK$vectors
D <- diag(eigK$values)

Next, $X$ and $Z_k$ are combined as $T_k$.

Tk <- cbind(X, Zk)

Let’s assume that $lambda = 0.5$. The computation of $H$ is possible in two ways: using $K$ or using $U, D$.

lambda <- 0.5
I <- diag(1, nrow(K))
H <- K * lambda + I
H[1:3, 1:3]

##                    NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0            19331.0             8434.5             8330.5
## NSFTV87@64fa0112.0             8434.5            19326.0            15771.5
## NSFTV96@32a6808e.0             8330.5            15771.5            19328.5

H2 <- U %*% (D * lambda + I) %*% t(U)
H2[1:3, 1:3]

##         [,1]    [,2]    [,3]
## [1,] 19331.0  8434.5  8330.5
## [2,]  8434.5 19326.0 15771.5
## [3,]  8330.5 15771.5 19328.5

The inverse of $H$ is similar, and can be calculated in two ways: directly by finding the inverse of $H$, or by using $U, D$. Note that the inverse of $D$ will simply be a matrix with the inverse of the eigenvalues as diagonal components.

Hinv <- solve(H)
Hinv[1:3, 1:3]

##                    NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0       1.139665e-04       2.708023e-06       4.262416e-06
## NSFTV87@64fa0112.0       2.708023e-06       4.475645e-04      -5.212194e-05
## NSFTV96@32a6808e.0       4.262416e-06      -5.212194e-05       2.975169e-04

Hinv2 <- U %*% diag(1 / (eigK$values * lambda + 1)) %*% t(U)
Hinv2[1:3, 1:3]

##              [,1]          [,2]          [,3]
## [1,] 1.139665e-04  2.708023e-06  4.262416e-06
## [2,] 2.708023e-06  4.475645e-04 -5.212194e-05
## [3,] 4.262416e-06 -5.212194e-05  2.975169e-04

Map $y$ and $T_k$ to the coordinate system $U$ obtained from the eigenvalue decomposition of $H$.

y2 <- t(U) %*% y
Tk2 <- t(U) %*% Tk

In the space to be mapped, the coordinate system is orthogonal and the inverse can be easily computed.

t(y) %*% Hinv %*% y

##            [,1]
## [1,] 0.07111932

t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% y2

##            [,1]
## [1,] 0.07111932

t(y) %*% Hinv %*% Tk

##                           Zk
## [1,] 0.001444281 0.001204933

t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2

##                           Zk
## [1,] 0.001444281 0.001204933

t(Tk) %*% Hinv %*% Tk

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

t(Tk2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

To further streamline the calculation, use crossprod, a native function of R.

w <- 1 / (eigK$values * lambda + 1)
crossprod(y2 * sqrt(w))

##            [,1]
## [1,] 0.07111932

crossprod(y2 * sqrt(w), Tk2 * sqrt(w))

##                           Zk
## [1,] 0.001444281 0.001204933

crossprod(Tk2 * sqrt(w))

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

Based on the results so far, let’s write a function to calculate the log-likelihood (excluding the constant part). The calculation is made on a logarithmic basis so that lambda does not take negative values.

loglike_emma <- function(log.lambda, y2, Tk2, eigK.values) {
  lambda <- exp(log.lambda) # with this trick, lambda is always positive 
  n <- nrow(Tk2)
  q2 <- ncol(Tk2)  
    # The number of columns of X is q. 
    # q2 (= q + 1) is the number of columns of Tk
  w <- 1 / (eigK$values * lambda + 1)
  yw <- y2 * sqrt(w)
  tw <- Tk2 * sqrt(w)
  yHy <- crossprod(yw)
  yHt <- crossprod(yw, tw)
  tHt <- crossprod(tw)
  ln.detH <- sum(log(eigK.values * lambda + 1))
  ll <- - 0.5 * ln.detH - 0.5 * log(det(tHt)) - 
    0.5 * (n - q2) * log(yHy - yHt %*% solve(tHt) %*% t(yHt))
  return(ll)
}

Compute the log likelihood (excluding the constant part) when $\lambda = 0.5$.

loglike_emma(log(0.5), y2, Tk2, eigK$values)

##          [,1]
## [1,] -886.691

Optimize $\lambda$ (log.lambda in the computational procedure) that maximizes the profile likelihood using an optimum function.

parm <- optim(par = log(0.5), fn = loglike_emma, 
              y2 = y2, Tk2 = Tk2, eigK.values = eigK$values,
              method="L-BFGS-B", lower = -100, upper = 100,
              control = list(fnscale = -1))
parm

## $par
## [1] -4.576745
## 
## $value
## [1] -875.7417
## 
## $counts
## function gradient 
##        8        8 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Convert the optimized log.lambda value to $\lambda$.

opt.lambda <- exp(parm$par)
opt.lambda

## [1] 0.01028833

From here, we obtain estimates of $\beta,\gamma_k$ based on the optimized $\lambda$ values.

w <- 1 / (eigK$values * opt.lambda + 1)
TkHinvTk <- crossprod(Tk2 * sqrt(w))
TkHinvy <- crossprod(Tk2 * sqrt(w), y2 * sqrt(w))
Cinv <- solve(TkHinvTk)
theta <- Cinv %*% TkHinvy
theta

##          [,1]
##    8.77527022
## Zk 0.06106566

Calculate the error variance.

yHinvy <- crossprod(y2 * sqrt(w))
s2 <- (yHinvy - t(TkHinvy) %*% Cinv %*% TkHinvy) / (n - ncol(X) -  1)
s2 <- drop(s2); s2

## [1] 0.006785645

Calculate the estimation error of the parameters.

v_theta = Cinv * s2
v_theta

##                         Zk
##    0.846255782 0.002571338
## Zk 0.002571338 0.004519092

Calculate the Wald statistic and perform a chi-square test.

Wk <- theta[2]^2 / v_theta[2,2]
Wk

## [1] 0.8251691

pk <- 1 - pchisq(Wk, df = 1)
pk

## [1] 0.3636732

-log10(pk)

## [1] 0.4392887

The $F$ test can also be performed.

pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk

## [1] 0.3642826

-log10(pk)

## [1] 0.4385616

GWAS using the association.test function in the gaston package.

gwas.res <- association.test(bm.wp, Y = y, method = "lmm", eigenK = eigK,
                             response = "quantitative", test = "wald", p = 0)

The result of the 1000th marker is shown.

gwas.res[1000, ]

##      chr     pos             id A1 A2    freqA2         h2       beta
## 1000   1 6829864 SNP-1.6828863.  C  T 0.2091413 0.01018355 0.06106563
##              sd         p
## 1000 0.06722419 0.3636735

The $P$ value is obtained by a chi-square test.

sd is the standard error (square root of the estimated error variance).

sqrt(v_theta[2,2])

## [1] 0.06722419

Note that \[ \lambda = \phi^2 / \sigma^2 \] When we rewrite it with $\phi^2 = \sigma_g^2, \sigma^2 = \sigma_e^2$, then \[ \lambda = \sigma_g^2 / \sigma_e^2 \] Also, the heritability due to the polygene effect is, \[ h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2 } \] then \[ h^2 = \frac{\lambda}{\lambda + 1} \]

The actual calculation also agrees with the result of the association.test function of gaston package.

h2 <- opt.lambda / (opt.lambda + 1)
h2

## [1] 0.01018355

Mixed modeling methods based on pedigree information

A similar model was proposed before the Q+K model, as described in Xu’s Quantitative Genetics book. However, Xu also states that it was rarely used because there was no implementation like TASSEL and it was not publised in a highly cited journal (Genetics vs. Nature Genetics). The model was proposed by Zhang et al. (2005) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449576/

Let’s look at this approach for a moment.

The model is described as follows \[ y = X\beta + Z_k \gamma_k + W\alpha + e \] Note that $\gamma_k$ is assumed to be a normally distributed $N(0, \sigma_k^2)$ variate effect, and thus the expectation and variance of $y$ are as follows. \[ E(y) = X\beta \\ var(y) = V = Z_k Z_k^T \sigma_k^2 + W W^T \sigma_\alpha^2 + I\sigma^2 \] The $Z_k$ and $W$ denote the pedigree composition at marker $k$ and across the genome, respectively.

Let $\Pi_k = Z_k Z_k^T$ and also $\Pi = WW^T$, then the variance is \[ var(y) = V = \Pi_k \sigma_k^2 + \Pi \sigma_\alpha^2 + I\sigma^2 \]

Tests for marker effects are variance-based and are based on the following null hypotheses \[ H_0: \sigma_k^2 = 0 \] The test statistic is \[ \lambda_k = -2 [L(0) - L(\hat{\sigma}_k^2)] \]

DEMMA (Decontaminated Efficient Mixed Model Association)

In the model treated above \[ y = T_k \theta_k + \xi + \epsilon \] the $\xi$ is the polygene effect, and if we use the marker genotype score, which can be described as \[ \xi = Z\delta = \sum_k^q Z_k \delta_k \]

This $\delta_k$ can be calculated as follows. \[ \hat{\delta}_k = E(\delta_k | y) = Z_k\hat{\lambda}(K\hat{\lambda}+I)^{-1}(y - X\hat{\beta}) \]

The $\xi$ is calculated as follows

xi <- (K * opt.lambda) %*% solve(K * opt.lambda + I) %*% (y - Tk %*% theta)
head(xi)

##                           [,1]
## NSFTV80@02d095ba.0   0.3722219
## NSFTV87@64fa0112.0  -0.2300204
## NSFTV96@32a6808e.0  -0.4378191
## NSFTV100@1f10be3d.0 -0.6898523
## NSFTV114@eac19fb8.0 -0.7013808
## NSFTV140@85551f9c.0 -0.5781901

The $\delta$ is calculated as follows

delta <- t(Z * opt.lambda) %*% solve(K * opt.lambda + I) %*% (y - Tk %*% theta)
head(delta)

##                       [,1]
## SNP-1.8562.  -9.692751e-05
## SNP-1.18594.  9.833706e-04
## SNP-1.24921.  5.470886e-04
## SNP-1.25253. -1.285817e-04
## SNP-1.29213.  2.430925e-04
## SNP-1.30477. -6.335946e-04

Let’s perform calculation to see if they actually match.

xi2 <- Z %*% delta
head(xi2)

##                           [,1]
## NSFTV80@02d095ba.0   0.3722219
## NSFTV87@64fa0112.0  -0.2300204
## NSFTV96@32a6808e.0  -0.4378191
## NSFTV100@1f10be3d.0 -0.6898523
## NSFTV114@eac19fb8.0 -0.7013808
## NSFTV140@85551f9c.0 -0.5781901

Check the match of the values with the figure.

plot(xi, xi2)
abline(0, 1, col = "green")

The match was confirmed.

Because $\hat{\xi}_k = Z_k\hat{\delta}_k$, the model can be expressed as \[ y = X\beta + Z_k \gamma_k + \xi - \hat{\xi}_k + \epsilon\\ = X\beta + Z_k \gamma_k + \xi - Z_k\hat{\delta}_k + \epsilon\\ = X\beta + Z_k (\gamma_k - \hat{\delta}_k) + \xi + \epsilon\\ \] The result shows that $\delta_k$ has a counteracting relationship with $\gamma_k$.

To prevent this counteracting relationship and not affect the $\gamma_k$ test, use the relationship \[ \xi = \sum_{k'=1}^m Z_{k'}\delta_k' = Z_k\delta_k + \sum_{k'\ne k}^m Z_k'\delta_k' \] and model the polygene effect by removing the effect of the $k$th marker.

Namely, \[ \xi_{-k} = \sum_{k'\ne k}^m Z_{k'}\delta_{k'} \]

At this time, the model is, \[ y = X\beta + Z_k \gamma_k + \xi_{-k} + \epsilon \] and its expected value is \[ E(y) = X_k\beta_k = X\beta + Z_k\gamma_k \] The variance is \[ V_k = \sum_{k'\ne k}^m Z_{k'}Z_{k'}^T\phi^2+I\sigma^2=K_{-k}\phi^2 + I\sigma^2 \\ \] where \[ K_{-k}=\sum_{k'\ne k}^m Z_{k'}Z_{k'}^T = K - Z_{k}Z_{k}^T \] and thus \[ V_k = K_{-k}\phi^2 + I\sigma^2 = K\phi^2 - Z_k Z_k^T\phi^2+I\sigma^2 = K\phi^2 + I\sigma^2 - Z_k Z_k^T\phi^2 \]

This is represented using $\lambda$ \[ V_k = \left[(K\lambda+I)-Z_kZ_k^T\lambda\right]\sigma^2 = (H - Z_kZ_k^T\lambda)\sigma^2 = H_k\sigma^2 \] where $\lambda = \phi^2 / \sigma^2$ and $H_k = H - Z_kZ_k^T\lambda$.

The $H_k$ inverse and determinant needed to compute the estimates can be computed using the $H$ inverse and determinant as follows. \[ H_k^{-1}=H^{-1}-H^{-1}Z_k\lambda(Z_k^TH^{-1}Z_k\lambda-I)^{-1}Z_k^TH^{-1}\\ |H_k| = |H||Z_k^TH^{-1}Z_k-\lambda^{-1}||-\lambda|=|H||I-Z_k^TH^{-1}Z_k\lambda| \]

Let’s calculate it in R. First, we define a function to calculate the profile likelihood (excluding the constant part of

loglike_demma <- function(log.lambda, y2, X2, Zk2, eigK.values) {
  lambda <- exp(log.lambda) 

  Tk2 <- cbind(X2, Zk2)
  n <- nrow(Tk2)
  q <- ncol(Zk2)
  q2 <- ncol(Tk2)

  w <- 1 / (eigK$values * lambda + 1)
  y3 <- y2 * sqrt(w)
  t3 <- Tk2 * sqrt(w)
  z3 <- Zk2 * sqrt(w)

  yy <- crossprod(y3)
  yt <- crossprod(y3, t3)
  ty <- t(yt)
  tt <- crossprod(t3)
  yz <- crossprod(y3, z3)
  zy <- t(yz)
  tz <- crossprod(t3, z3)
  zt <- t(tz)
  zz <- crossprod(z3)

#  lambda.Hkinv <- lambda * solve(lambda * zz - diag(q))
  lambda.Hkinv <- lambda / (lambda * drop(zz) - 1)
  yHy <- yy - yz %*% lambda.Hkinv %*% zy
  tHy <- ty - tz %*% lambda.Hkinv %*% zy
  yHt <- t(tHy)
  yHz <- yz - yz %*% lambda.Hkinv %*% zz
  zHy <- t(yHz)
  tHt <- tt - tz %*% lambda.Hkinv %*% zt
  tHz <- tz - tz %*% lambda.Hkinv %*% zz
  zHt <- t(tHz)
  zHz <- zz - zz %*% lambda.Hkinv %*% zz

  ln.detH <- sum(log(eigK.values * lambda + 1))
#  ln.detHk <- ln.detH + log(det(diag(q) - lambda * zz))
  ln.detHk <- ln.detH + log(diag(q) - lambda * drop(zz))
  ln.detTkHkinvTk <- log(det(tHt))
  ln.yPy <- log(yHy - yHt %*% solve(tHt) %*% tHy)
  ll <- - 0.5 * ln.detHk - 0.5 * ln.detTkHkinvTk - 
    0.5 * (n - q2) * ln.yPy
  return(ll)
}

As before, map $y, X, Z_k$ with eigenvectors.

y2 <- t(U) %*% y
X2 <- t(U) %*% X
Zk2 <- t(U) %*% Zk

Find the $\lambda$ that maximizes the profile likelihood using the optim function.

parm <- optim(par = log(0.5), fn = loglike_demma, 
              y2 = y2, X2 = X2, Zk2 = Zk2, eigK.values = eigK$values,
              method="L-BFGS-B", lower = -100, upper = 100,
              control = list(fnscale = -1))
opt.lambda <- exp(parm$par)
opt.lambda

## [1] 0.01028833

Calculate the estimated values.

w <- 1 / (eigK$values * opt.lambda + 1)

y3 <- y2 * sqrt(w)
t3 <- Tk2 * sqrt(w)
z3 <- Zk2 * sqrt(w)

yy <- crossprod(y3)
yt <- crossprod(y3, t3)
ty <- t(yt)
tt <- crossprod(t3)
yz <- crossprod(y3, z3)
zy <- t(yz)
tz <- crossprod(t3, z3)
zt <- t(tz)
zz <- crossprod(z3)

#lambda.Hkinv <- lambda * solve(lambda * zz - diag(q))
lambda.Hkinv <- lambda / (lambda * drop(zz) - 1)
yHy <- yy - yz %*% lambda.Hkinv %*% zy
tHy <- ty - tz %*% lambda.Hkinv %*% zy
yHt <- t(tHy)
tHt <- tt - tz %*% lambda.Hkinv %*% zt

yPy <- yHy - yHt %*% solve(tHt) %*% tHy

Cinv <- solve(tHt)
theta <- Cinv %*% tHy
theta

##          [,1]
##    8.77527022
## Zk 0.06106566

The error variance $\sigma^2$ is estimated as

s2 <- yPy / (n - ncol(X) -  1)
s2 <- drop(s2); s2

## [1] 0.006785645

then the estimation error variance is calculated as

v_theta = Cinv * s2
v_theta

##                         Zk
##    0.846255782 0.002571338
## Zk 0.002571338 0.001126270

Finally, the Wald statistic is computed and tested the siginificance of the k-th marker. \[ W_k^{\rm{DEMMA}} = \frac{(\hat{\gamma}_k^{\rm{DEMMA}})^2}{var(\hat{\gamma}_k^{\rm{DEMMA}})} \]

Let’s perform the test.

Wk_demma <- theta[2]^2 / v_theta[2,2]
Wk_demma

## [1] 3.310943

pk <- 1 - pchisq(Wk_demma, df = 1)
pk

## [1] 0.06882

-log10(pk)

## [1] 1.162285

When the number of markers is large, there is generally no significant difference with the original model. This is because the $k$th marker and the markers adjacent to them are highly correlated.

EMMAX (Efficient Mixed Model Association Studies Expedited)

Excluding the $k$-th marker effect from the above model yields the following model: \[ y = X\beta + \xi + \epsilon \]

Here, the expected value and variance of $y$ are as follows: \[ E(y) = X\beta\\ var(y) = V = K\phi^2 + I\sigma^2 = (K\phi^2/\sigma^2+I)\sigma^2=(K\lambda+I)\sigma^2 \] Note that using the eigenvalue decomposition of $K$, \[ (K\lambda+I)^{-1} = U(D\lambda+I)^{-1}U^T \] where \[ (D\lambda+I)^{-1} = W = diag{(d_j\lambda+1)^{-1}} \] If $\hat{\lambda}$ is the REML estimate of $\lambda$ and this is treated as a known value, the weight matrix is given as a constant diagonal matrix.

Let’s try to calculate the optimal $\lambda$ under this model.

X2 <- t(U) %*% X
parm <- optim(par = log(0.5), fn = loglike_emma, 
              y2 = y2, Tk2 = X2, eigK.values = eigK$values,
              method="L-BFGS-B", lower = -1000, upper = 1000,
              control = list(fnscale = -1))
opt.lambda <- exp(parm$par); opt.lambda

## [1] 0.01018

Let us generalize the error to include the polygenic effect and let $e = \xi + \epsilon$, then GWAS model is \[ y = X\beta + Z_k\gamma_k + e \] For this expression, we map the relevant variables using U, the eigenvalue decomposition of K, and express the above model again as \[ y^* = X^*\beta + Z_k^* \gamma_k + e^* \] where \[ y^* = U^T y\\ X^* = U^TX\\ Z_k^* = U^T Z_k\\ e^* = U^T e \]

Here, the expected value of $y^*$ is, \[ E(y^*) = X^* \beta +Z_k^* \gamma_k\\ \]

And the variance of $y^*$ is, \[ var(y^*) = U^T var(e) U = U^T (K\lambda + I) U \sigma^2 = U^T (UDU^T\lambda + I) U\sigma^2 = (D\lambda + I)\sigma^2 = W^{-1}\sigma^2 \] This can be treated as the error variance of the regression. In other words, it can be viewed as a regression model (a multiple regression model with only fixed effects) where each element of $y^*$ has a different error .

The estimates of regression coefficients can be obtained under the heterogeneous error variances as \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^{*T}WX^* & X^{*T}WZ_k^* \\ Z_k^{*T}WX^* & Z_k^{*T}WZ_k^* \end{bmatrix}^{-1} \begin{bmatrix} X^{*T}Wy^* \\ Z_k^{*T}Wy^* \end{bmatrix} \]

The error variance can be estimated as \[ \hat{\sigma}^2 = \frac{1}{n-q-1} (y^* - X^*\hat{\beta}-Z_k^*\gamma_k)^TW(y^* - X^*\hat{\beta}-Z_k^*\gamma_k) \]

The variances of the regression coefficient estimates are as follows. \[ var \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^{*T}WX^* & X^{*T}WZ_k^* \\ Z_k^{*T}WX^* & Z_k^{*T}WZ_k^* \end{bmatrix}^{-1} \hat{\sigma}^2 = \begin{bmatrix} C_{XX} & C_{XZ} \\ C_{ZX} & C_{ZZ} \end{bmatrix} \hat{\sigma}^2 \]

From the above equation, $var(\hat{\gamma}_k)=C_{ZZ}\hat{\sigma}^2$, so Wald’s statistic for $\gamma_k$ can be obtained as \[ W_k = \frac{\hat{\gamma}_k^2}{var(\hat{\gamma}_k)} = \frac{\hat{\gamma}_k^2}{C_{ZZ}\hat{\sigma}^2} = C_{ZZ}^{-1}\frac{\hat{\gamma}_k^2}{\hat{\sigma}^2} \]

From here, we can actually do the calculations.

Map marker genotype scores with $U$.

Z2 <- t(U) %*% Z

Calculate weights to account for differences in error variance for each element of $y^*$.

w2 <- 1 / (eigK$values * opt.lambda + 1)

Specify the weights and obtain a regression model with the lm function (a function that solves a linear regression model for fixed effects).

model <- lm(y2 ~ X2 + Z2[,1000] - 1, weights = w2)
anova(model)

## Analysis of Variance Table
## 
## Response: y2
##             Df  Sum Sq Mean Sq F value Pr(>F)    
## X2           1 0.62014 0.62014 90.4900 <2e-16 ***
## Z2[, 1000]   1 0.00565 0.00565  0.8247 0.3644    
## Residuals  359 2.46028 0.00685                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prepare a container of p-values of all markers and perform the same regression analysis for all markers with the weights.

p.emmax <- rep(NA, ncol(Z2))
for(i in 1:ncol(Z2)) {
  model <- lm(y2 ~ X2 + Z2[,i] - 1, weights = w)
  p.emmax[i] <- anova(model)[2,5]
}

Compare $-\log_{10}(p)$ values between normal GWAS and GWAS with EMMAX.

op <- par(mfrow = c(1, 2))
plot(p.emmax, gwas.res$p)
abline(0, 1, col = "green")
plot(-log10(p.emmax), -log10(gwas.res$p))
abline(0, 1, col = "green")

par(op)

EMMAX is particularly useful in models with multiple kernels (variable effects whose covariance is specified by a particular matrix).

The following is a model with $L$ kernels. \[ y = X\beta + \sum_{l=1}^L \xi_l + \epsilon \]

For this model, the variance of $y$ is, \[ var(y) = V = \sum_{l=1}^L K_l \phi_l^2 + I\sigma^2 = \left(\sum_{l=1}^L K_l\lambda_l + I\right)\sigma^2 = (K + I) \sigma^2 \] where \[ K = \sum_{l=1}^L K_l \lambda_l \\ \lambda_l = \phi_l^2 / \sigma^2 \] Multiple kernels can be combined into one kernel using $\lambda_l$ as the weight.

The calculation after combining the kernels into K is the same as in the previous section. Namely, \[ V = (K + I) \sigma^2 = U(D + I) U^T \sigma^2 \]

$$ y = X+ Z_k _k + e\ y^* = X^+ Z_k^ _k + e^*

The weights of the regression are similarly obtained as follows \[ var(y^*) = U^T (K + I) U \sigma^2 = U^T (UDU^T + I) U\sigma^2 = (D + I)\sigma^2\\ W = (D + I)^{-1} = diag\left((d_j + 1)^{-1}\right) \]

Let’s do the actual calculation. Here, we add a kernel for the dominance effect.

Zd <- 1 - abs(Z)
Kd <- tcrossprod(Zd)

Calculations are performed using the emmremlMultiKernel function of the EMMREML package for regression models with multiple kernels.

model <- emmremlMultiKernel(y, X, Zlist = list(diag(n),  diag(n)), Klist = list(K, Kd))

Calculate the weights that combine the kernels.

lambda <- model$weights * model$Vu / model$Ve

Combine the kernel with the calculated weight.

Ku <- K * lambda[1] + Kd * lambda[2]

Eigenvalue decomposition of $K$

eigKu <- eigen(Ku)
U <- eigKu$vectors

Mapping of Variables by Eigenvector Matrix of $K$

y3 <- t(U) %*% y
X3 <- t(U) %*% X
Z3 <- t(U) %*% Z

Weight calculation

w3 <- 1 / (eigK$values * opt.lambda + 1)

Test for the effect of the 1000th marker

model <- lm(y3 ~ X3 + Z3[,1000] - 1, weights = w3)
anova(model)

## Analysis of Variance Table
## 
## Response: y3
##             Df  Sum Sq Mean Sq F value Pr(>F)    
## X3           1 0.62385 0.62385  91.219 <2e-16 ***
## Z3[, 1000]   1 0.00611 0.00611   0.894  0.345    
## Residuals  359 2.45522 0.00684                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculate similarly for all markers.

p.emmax.ad <- rep(NA, ncol(Z3))
for(i in 1:ncol(Z3)) {
  model <- lm(y3 ~ X3 + Z3[,i] - 1, weights = w3)
  p.emmax.ad[i] <- anova(model)[2,5]
}

Compare the results with a normal GWAS. The value of $-\log_{10}(p)$ is different compared to the previous comparison because of the inclusion of dominant effects. However, the significant markers do not change significantly. This is probably due to the fact that there are only a small number of heterozygos genotypes in the present data.

op <- par(mfrow = c(1, 2))
plot(p.emmax.ad, gwas.res$p)
abline(0, 1, col = "green")
plot(-log10(p.emmax.ad), -log10(gwas.res$p))
abline(0, 1, col = "green")

par(op)

PCA

The Q+K model is expressed in the following form: \[ y = \mu + Q\alpha + Z_k\gamma_k + \xi + e \]

Principal component analysis may be used to calculate Q. In this section, we will focus on the principal component analysis performed from marker genotype scores.

In principal component analysis, the following eigenvalue decomposition is usually used. The following is a principal component analysis based on the sum-of-products matrix. \[ Z^TZ = V\Lambda V^T \] The principal component scores are also calculated as \[ S = ZV \]

The singular value decomposition of $Z$ is expressed as \[ Z = U\Sigma V^T \] Thus, $Z^TZ$ and $ZZ^T$ can also be expressed as follows. \[ Z^TZ = (V\Sigma^TU^T)(U\Sigma V^T) = V\Sigma^T\Sigma V^T \\ ZZ^T = (U\Sigma V^T)(V\Sigma^TU^T) = U\Sigma\Sigma^T U^T \]

Thus, the principal component scores can also be calculated as \[ S = ZV = U\Sigma V^T V = U\Sigma \]

Thus, a principal component analysis can also be performed based on the eigenvalue decomposition of $ZZ^T$ \[ ZZ^T = U\Lambda_2 U^T \] This is faster when the number of samples is small.

Let’s calculate principal component scores in multiple ways. To reduce the computation time, the number of markers is limited to 1000.

Zs <- scale(Z[,1:1000], center = T, scale = F)

Perform eigenvalue decomposition of $Z^TZ$

eigZtZ <- eigen(crossprod(Zs))

Principal scores calculated by the first method:

s1 <- Zs %*% eigZtZ$vectors
head(s1[,1:3])

##                          [,1]       [,2]       [,3]
## NSFTV80@02d095ba.0  0.4142829   5.627109 -3.9271766
## NSFTV87@64fa0112.0  2.0152473  -7.908744 -0.4507248
## NSFTV96@32a6808e.0  2.2035666 -17.565195 -5.6615092
## NSFTV100@1f10be3d.0 2.9005971 -19.616405 -6.0565363
## NSFTV114@eac19fb8.0 2.3298824 -17.788410 -4.8586702
## NSFTV140@85551f9c.0 4.6840942 -17.635155 -5.3393379

Perform singular value decomposion of $Z$

svdZs <- svd(Zs)

Principal scores calculated by the second method:

s2 <- svdZs$u %*% diag(svdZs$d)
head(s2[,1:3])

##            [,1]      [,2]      [,3]
## [1,] -0.4142829 -5.627109 3.9271766
## [2,] -2.0152473  7.908744 0.4507248
## [3,] -2.2035666 17.565195 5.6615092
## [4,] -2.9005971 19.616405 6.0565363
## [5,] -2.3298824 17.788410 4.8586702
## [6,] -4.6840942 17.635155 5.3393379

Perform eigenvalue decomposition of $ZZ^T$

eigZZt <- eigen(tcrossprod(Zs))

Principal scores calculated by the third method:

d2 <- eigZZt$values
d2[d2 < 0] <- 0
s3 <- eigZZt$vectors %*% diag(sqrt(d2))
head(s3[,1:3])

##            [,1]       [,2]       [,3]
## [1,] -0.4142829   5.627109 -3.9271766
## [2,] -2.0152473  -7.908744 -0.4507248
## [3,] -2.2035666 -17.565195 -5.6615092
## [4,] -2.9005971 -19.616405 -6.0565363
## [5,] -2.3298824 -17.788410 -4.8586702
## [6,] -4.6840942 -17.635155 -5.3393379

Using the latter two methods, let’s try to perform a principal component analysis using all markers.

Zs.all <- scale(Z, center = T, scale = F)

Singular value decomposition of $Z$:

tic()
svdZ <- svd(Zs.all)
toc()

## 8.641 sec elapsed

Eigenvalue decomposition of $ZZ^T$：

tic()
eigZZt <- eigen(tcrossprod(Zs.all))
toc()

## 1.85 sec elapsed

Principal component scores are calculated and visualized.

score <- eigZZt$vectors %*% diag(sqrt(eigZZt$values))
colnames(score) <- paste0("PC", 1:ncol(score))
pairs(score[,1:3])

GWAS based on GBLUP

The GBLUP model commonly used in genomic selection can be expressed as \[ y = X\beta + \xi + e \]

The $\xi$ in the model above can be calculated from the matrix of marker genotype scores and the marker effect $\delta$ as \[ y = X\beta + Z\delta + e \] Then, the covariance of $\xi$ is \[ K \sigma_{\xi}^2 = var(\xi) = var(Z\delta)=Z var(\delta)Z^T \\ = ZZ^T \sigma_{\delta}^2 \]

Here, since $K = ZZ^T$, we know that $\sigma_{\xi}^2 = \sigma_{\delta}^2$.

In this case, the best linear unbiased predictor (BLUP) of the marker effect $\hat{\delta}$ can be calculated as \[ BLUP(\hat{\delta}) = cov(\delta, \xi) [var(\xi)]^{-1} \hat{\xi} \\ = cov(\delta, \delta) Z^T K^{-1} (\sigma_{\xi}^2)^{-1} \hat{\xi} \\ = \left(\frac{\sigma_{\delta}^2}{\sigma_{\delta}^2}\right) Z^T K^{-1} \hat{\xi} = Z^T K^{-1} \hat{\xi} \]

Calculate this using R. Here, we use the lmm.diego function of the gaston package to solve the linear mixed model:

K <- tcrossprod(Z)
eigK <- eigen(K)
model <- lmm.diago(Y = y, X = matrix(1, nrow = length(y)), eigenK = eigK)

## Optimization in interval [8.40963e-07, 1]
## Optimizing with p = 0
## [Iteration 1] Current point = 8.40963e-07 df = 1.63505e+07
## [Iteration 2] Current point = 2.39767e-06 df = 8.13073e+06
## [Iteration 3] Current point = 5.50912e-06 df = 4.14694e+06
## [Iteration 4] Current point = 1.21057e-05 df = 2.10331e+06
## [Iteration 5] Current point = 2.55375e-05 df = 1.02509e+06
## [Iteration 6] Current point = 5.02518e-05 df = 485666
## [Iteration 7] Current point = 9.30922e-05 df = 227376
## [Iteration 8] Current point = 0.000165525 df = 105836
## [Iteration 9] Current point = 0.000286478 df = 49126.8
## [Iteration 10] Current point = 0.000487387 df = 22807.3
## [Iteration 11] Too many iterations, using Brent algorithm
## [Iteration 11] Brent gives 0.0100776

From $\xi$(BLUP_omega), obtain the BLUP for the marker effect.

xi <- model$BLUP_omega
tic()
delta <- t(Z) %*% solve(K) %*% xi
toc()

## 2.027 sec elapsed

It can also be computed using eigenvalue decomposition, as was done above.

tic()
U <- eigK$vectors
w <- 1 / eigK$values
delta2 <- crossprod((t(U) %*% Z) * sqrt(w), (t(U) %*% xi) * sqrt(w))
toc()

## 1.824 sec elapsed

Both are (of course) perfectly consistent.

plot(delta, delta2)
abline(0, 1, col = "green")

The $\xi$ computed from it is also perfectly consistent.

plot(xi, Z %*% delta)
abline(0, 1, col = "green")

The variance of the estimate of $\delta$ can be calculated in relation to the estimate of $\xi$ as follows. \[ var(\hat{\delta}) = var(Z^T K^{-1} \hat{\xi}) = Z^T K^{-1} var(\hat{\xi}) K^{-1} Z \]

Predicted error variance of $\hat{\xi}$ is \[ PEV(\hat{\xi}) = var(\xi - \hat{\xi}) = C^{\xi\xi} = var(\xi) - var(\hat{\xi}) \] then, the variance of $\hat{\xi}$ can be caucluated as \[ var(\hat{\xi}) = var(\xi) - C^{\xi\xi} = K \sigma_{\xi}^2 - C^{\xi\xi} \] where \[ C^{\xi\xi} = \sigma^2 (I - X(X^TX)^{-1}X^T + K^{-1} \sigma^2 / \sigma_{\xi}^2)^{-1} \]

Finally, the variance of $\hat{\delta}$ can be calculated as: \[ Var(\hat{\delta}) = Z^T K^{-1} (K \sigma_{\xi}^2 - C^{\xi\xi}) K^{-1} Z \\ = Z^T K^{-1} Z \sigma_{\xi}^2 - Z^T K^{-1} C^{\xi\xi} K^{-1} Z \]

Actually calculate $C^{\xi\xi}$ with R.

s2 <- model$sigma2
s2xi <- model$tau
Kinv <- solve(K)
Cxx <- s2 * solve(diag(nrow(X)) - X %*% solve(t(X) %*% X) %*% t(X) - Kinv * s2 / s2xi)

If $var(\hat{\delta})$ is calculated properly, the dimension is high and it is very difficult. Calculate only the diagonal part that is needed.

ZtKinv <- t(Z) %*% Kinv
v1 <- apply(ZtKinv * t(Z), 1, sum) * s2xi
v2 <- apply((ZtKinv %*% Cxx) * ZtKinv, 1, sum)
v_delta <- v1 - v2

Test the significance of the effect of the $k$th marker with a Wald test.

Wk <- delta^2 / v_delta
pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
head(-log10(pk))

##                    [,1]
## SNP-1.8562.  0.04766425
## SNP-1.18594. 0.97578589
## SNP-1.24921. 0.32368697
## SNP-1.25253. 0.06146313
## SNP-1.29213. 0.11722743
## SNP-1.30477. 0.18957684

Draw a plot (because x axis does not represent the positions of markers, this is not a Manhattan plot, but looks very similar).

plot(-log10(pk))

To compare the results with the normal GWAS, we conduct the GWAS with the association.test function in the gaston package.

gwas.res <- association.test(bm.wp, Y = y, method = "lmm", eigenK = eigK,
                           response = "quantitative", test = "wald", p = 0)

Draw figures for the comparison.

op <- par(mfrow = c(1, 2))
plot(gwas.res$p, pk)
abline(0, 1, col = "green")
plot(-log10(gwas.res$p), -log10(pk))
abline(0, 1, col = "green")

par(op)

Marker effects are also compared.

plot(gwas.res$beta, delta)
abline(h = 0); abline(v = 0)

GWAS models

岩田洋佳 hiroiwata@g.ecc.u-tokyo.ac.jp

2023-12-05