Introduction

Objectives

This document attempts to explain the GWAS model in 10 minutes. However, this is probably not possible. I encourage you to read this document and study it for yourself.

Required packages

To execute this code, gaston package is necessary. Before loading the package, you should install the package to your R environment.

# if you have not installed the package
# install.packages("gaston")
require(gaston)

Data preparation

Read genome-wide marker data

First, marker genotype data in vcf format (McCouch et al. 2016) are loaded.

bm <- read.vcf("HDRA-G6-4-RDP1-RDP2-NIAS.AGCT_MAF005MIS001NR_imputed.vcf.gz")

## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.

bm # show the summary of the file

## A bed.matrix with 1568 individuals and 38769 markers.
## snps stats are set
## ped stats are set

SNP genotype score data

bm contains SNP genotypes coded as scores, 0, 1, and 2.

as.matrix(bm[1:10, 1:4])

##                       SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## IRGC121316@c88dcbba.0           0            0            0            0
## IRGC121578@950ba1f2.0           0            0            0            0
## IRGC121491@0b7d031b.0           0            0            0            2
## IRGC121440@d02dfe4e.0           0            0            0            2
## IRGC121896@89688646.0           0            0            0            0
## IRGC121323@57c71a3b.0           0            0            0            0
## IRGC121251@392016d6.0           0            0            0            0
## IRGC121902@f1be6ab1.0           0            0            0            1
## IRGC124483@9a9187e6.0           0            0            0            0
## IRGC121314@6c2404e4.0           0            0            0            0

Read phenotypic data

Next, read in the phenotypic data obtained for the same genetic resource accession.

# read phenotypic data
pheno <- read.csv("RiceDiversityPheno4GWASGS.csv", row.names = 1)

Here, seed length is used as an example trait.

y <- pheno$Seed.length.width.ratio # target trait
names(y) <- rownames(pheno) # naming the elements of y
y <- na.omit(y) # remove missing elements

Extract marker genotype data for analysis

Since phenotypic data are not available for all lines, SNP data for lines with phenotypic data are extracted from bm. Here, the names given to y are used.

# select samples with ids contained in rownames(pheno)
bm.wp <- bm[bm@ped$id %in% names(y),]

Remove monomorphic and low MAF markers

bm.wp <- select.snps(bm.wp, maf > 0.01)
bm.wp

## A bed.matrix with 361 individuals and 38752 markers.
## snps stats are set
## ped stats are set

Finally, the phenotype data \(y\) are sorted in the same order as the SNP data bm.wp.

y <- y[bm.wp@ped$id] 
# reorder the elements of y as same as the bm@ped$id

Simple regression analysis

Statistical model

The model for a single regression analysis can be written as \[ y = X\beta + Z_k \gamma_k + e \] where \(X\) is the design matrix of fixed effects (year, study site, etc.) including the overall mean, \(\beta\) is the fixed effect, \(Z_k\) is the scored genotype (-1, 0, 1) of the \(k\)th marker, \(\gamma_k\) is the effect of marker \(k\), \(e\) is the error term including environmental errors and other SNPs’ effects.

Prepare X

Here we create a design matrix for the over-all mean.

n <- length(y)
X <- matrix(1, nrow = n, ncol = 1)
head(X)

##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1
## [5,]    1
## [6,]    1

Prepare Z

Note that the \(j\)th element of \(Z_k\) represents the genotype of the \(k\)th marker of the \(j\)th genotype (accession, line, variety etc), as mentioned earlier. \[ Z_{jk} = \left\{ \begin{array}{ll} -1 & \rm{for}\, A_1 A_1 \\ 0 & \rm{for}\, A_1 A_2 \\ 1 & \rm{for}\, A_2 A_2 \end{array} \right. \]

Z matrix

In bm.wp, it is scored at 0, 1, 2, so subtract 1 from the score to -1, 0, 1.

Z <- as.matrix(bm.wp) - 1
head(Z[,1:4])

##                     SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## NSFTV80@02d095ba.0           -1           -1           -1           -1
## NSFTV87@64fa0112.0           -1           -1           -1           -1
## NSFTV96@32a6808e.0           -1           -1           -1           -1
## NSFTV100@1f10be3d.0          -1           -1           -1           -1
## NSFTV114@eac19fb8.0          -1           -1           -1           -1
## NSFTV140@85551f9c.0          -1           -1           -1           -1

Estimates of the parameters

Estimates of the regression coefficients \(\beta\) and \(\gamma_k\) are obtained by the following matrix calculation. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \begin{bmatrix} X^Ty \\ Z_k^T y \end{bmatrix} \]

Calculation with lm function

Let’s estimate the regression coefficients for the 1000th SNP (k = 1000).

k <- 1000
Zk <- Z[, k]
model <- lm(y ~ Zk)

Perform a single regression analysis with R lm function.

summary(model)

## 
## Call:
## lm(formula = y ~ Zk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8671 -0.4128 -0.1118  0.3309  1.7529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.80993    0.03563   78.86  < 2e-16 ***
## Zk           0.09587    0.03578    2.68  0.00771 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5495 on 359 degrees of freedom
## Multiple R-squared:  0.01961,    Adjusted R-squared:  0.01688 
## F-statistic:  7.18 on 1 and 359 DF,  p-value: 0.007712

C matrix

Let’s try to find it by matrix calculation. First, we prepare the matrix C \[ C = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix} \]

Calculate the C matrix

Prepare the C matrix.

xtx <- t(X) %*% X
xtz <- t(X) %*% Zk
ztx <- t(Zk) %*% X
ztz <- t(Zk) %*% Zk

C <- matrix(c(xtx, xtz, ztx, ztz), nrow = 2, ncol = 2, byrow = T)
C

##      [,1] [,2]
## [1,]  361 -210
## [2,] -210  358

Inverse of the C matrix

Calculate the inverse of C matrix.

Cinv <- solve(C)
Cinv

##             [,1]        [,2]
## [1,] 0.004204938 0.002466584
## [2,] 0.002466584 0.004240175

Calculate regression coefficients with the inverse of C

Compute estimates of the regression coefficients using the inverse matrix.

xty <- t(X) %*% y
zty <- t(Zk) %*% y

theta <- Cinv %*% matrix(c(xty, zty), nrow = 2, ncol = 1)
theta

##            [,1]
## [1,] 2.80993368
## [2,] 0.09587171

Error variance

The error variance \(\sigma^2\) can be estimated as \[ \hat{\sigma^2}=\frac{1}{n-q-1}(y-X\hat{\beta}-Z_k\hat{\gamma}_k)^T(y-X\hat{\beta}-Z_k\hat{\gamma}_k) \]

Calculation of \(\hat{\sigma^2}\)

To easily calculate \(X\hat{\beta}-Z_k\hat{\gamma}_k\), \(X\) and \(Z_k\) are bundled together as \(T_k\). Then \(X\hat{\beta}-Z_k\hat{\gamma}_k\) can be calculated as \(T_k \hat{\theta}\).

Tk <- cbind(X, Zk)

Now let’s calculate the error variance.

s2 <- (1 / (n - ncol(X) -  1)) * 
  t(y - Tk %*% theta) %*% (y - Tk %*% theta)
s2 <- drop(s2); s2

## [1] 0.3019159

Variance of the estimated regression coefficients

The variance of the estimated regression coefficients can be determined based on this estimate of the error variance. \[ \rm{var} \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} \rm{var}(\hat{\beta}) & \rm{cov}(\hat{\beta}, \hat{\gamma}_k) \\ \rm{cov}(\hat{\gamma}_k, \hat{\beta}) & \rm{var}(\hat{\gamma}_k) \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \sigma^2 = \begin{bmatrix} C_{XX} & C_{XZ} \\ C_{ZX} & C_{ZZ} \end{bmatrix} \sigma^2 \]

The actual calculation is

v_theta = Cinv * s2
v_theta

##              [,1]         [,2]
## [1,] 0.0012695378 0.0007447009
## [2,] 0.0007447009 0.0012801764

Significance test of the regression coefficients \(\gamma_k\)

A test for whether the regression coefficient \(\gamma_k\) is zero can be performed using the following statistic \[ W_k = \frac{\hat{\gamma}_k^2}{\rm{var}(\hat{\gamma_k})} \] This is called the Wald test.

Actual calculation

Let’s actually calculate the statistic.

gammak <- theta[2]
v_gammak <- v_theta[2,2]
Wk <- gammak^2 / v_gammak
Wk

## [1] 7.179781

Chi-square test of the \(W_k\)

This statistic follows a chi-square distribution with one degree of freedom under the null hypothesis. Thus, \(p\)-value can be calculated as \[ p_k = 1 - \rm{Pr}(\chi_1^2 < W_k) \] Let’s calculate it.

pk <- 1 - pchisq(Wk, df = 1)
pk

## [1] 0.007372972

Calculate \(-log_{10}(p)\) value

Note that in GWAS, the exponent (\(-\log_{10}(p)\)) is often used by taking the ordinary logarithm of the \(p\) value and multiplying it by \(-1\). \[ \eta_k = -\log_{10}(p_k) \]

The actual calculation is

-log10(pk)

## [1] 2.132357

Precise distribution of \(W_k\)

More precisely, \(W_k\) follows an \(F\) distribution with degrees of freedom \(1\), \(n - q - 1\), where \(q\) denotes the number of markers, under the null hypothesis.

pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk

## [1] 0.00771248

-log10(pk)

## [1] 2.112806

Comparison with anova and lm functions

This result can be obtained as a result of an analysis of variance (\(F\) test is performed) for a single regression model using \(lm\).

model <- lm(y ~ Zk)
anova(model)

## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## Zk          1   2.168 2.16769  7.1798 0.007712 **
## Residuals 359 108.388 0.30192                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculation with association.test function in gaston

Let’s run a simple regression analysis on all markers using the association.test function in gaston.

gwas.res <- association.test(bm.wp, Y = y, method = "lm", 
                             response = "quantitative", test = "wald")

Extract the result of the 1000th marker

The results of the analysis of the 1000th marker are

gwas.res[1000, ]

##      chr     pos             id A1 A2    freqA2       beta         sd
## 1000   1 6829864 SNP-1.6828863.  C  T 0.2091413 0.09587171 0.03577955
##               p
## 1000 0.00771248

sd is the standard deviation of the estimated \(\hat{\gamma_k}\) (i.e., standard error). This can be computed as

sqrt(v_theta[2,2])

## [1] 0.03577955

Heritability

The heritability calculated by this marker effect alone is, \[ h_k^2 = \frac{\sigma^2_{G_k}}{\sigma^2_{G_k}+\sigma^2} = \frac{\rm{var}(Z_k)\gamma_k^2}{\rm{var}(Z_k)\gamma_k^2+\sigma^2} = \frac{\sigma_{Z_k}^2\gamma_k^2}{\sigma_{Z_k}^2\gamma_k^2+\sigma^2} \] where \(\rm{var}(Z_k) = \sigma_{Z_k}^2\) and \(\sigma^2_{G_k} = \sigma_{Z_k}^2\gamma_k^2\).

Let’s do the actual calculations.

sgk2 <- var(Zk) * gammak^2
hk2 <- sgk2 / (sgk2 + s2); hk2

## [1] 0.01955386

Manhattan plot

Draw a Manhattan plot based on the simple regression.

manhattan(gwas.res, pch = 20)

Q-Q plot

The -log10(p) values of the significance test are inflated. This can be checked with a Q-Q plot.

qqplot.pvalues(gwas.res$p, pch = 20)

Most markers have a -log10(p) value greater than expected under the null hypothesis.

Inflation of -log10(p) with a subpopulation structure

The inflation of the \(-log_{10}(p)\) values is due to the subpopulation structure among the samples and should be controlled by considering the genetic relationships among samples in a GWAS model.

Mixed linear model based on relatedness inferred from marker genotypes

Statistical model

This model, called the Q+K model (Yu et al. 2006), is one of the most commonly used GWAS models for quantitative traits. The model equation is as follows. \[ y = Q\alpha + X\beta + Zu + e \] Here, \(Q\) consists of a numerical value that expresses the population structure. For example, \(Q\) is the probability of belonging to a subpopulation as estimated by Structure analysis (Pritchard et al. 2000) or the score of the top principal component of genome-wide marker data. The \(\alpha\) represents the effect of each subpopulation on the phenotype and is estimated as a regression coefficient.

\(X\) represents the marker genotype of interest and \(\beta\) is its effect. The \(u\) represents the polygene effect distributed across the genome, and \(Z\) is the design matrix to link the effect to each sample. If each sample is measured once, \(Z\) becomes the unit matrix. The interpretation of \(E\) is the same as before.

Expected value and variance of \(y\)

Under this model, the expected value of \(y\) is, \[ E(y) = Q\hat{\alpha} + X\hat{\beta} \] and the variance is, \[ var(y) = V = ZKZ^T \sigma_u^2 + I\sigma^2 \] ## Test statistic for \(\beta\)

\[ W = \frac{\hat{\beta}^2}{var(\hat{\beta})} \]

Model with aligned description

The above model is rewritten and represented as \[ y = X\beta + Z_k \gamma_k + \xi + \epsilon \] where \(X\beta\) is the fixed effect and represents the grand mean or sub-population structure. This is represented as \(Q\alpha\) in the above model. \(Z_k \gamma_k\) is the effect of k-th marker, which is the marker of interest. \(Z_k\) represent the status of the genotypes of the marker. \(\xi\) is the random effect which is represented as \(Zu\) in the previous model. \(\epsilon\) is the error term.

Variance of the random effect \(\xi\)

We assume the random effect \(\xi\) (which is a vector with the length of sample number) follows the multi normal distribution with mean \(\mathbf0\) (which is a zero vector with the length of sample number) and variance \[ var(\xi) = K \phi^2 \]

Assume \(\xi\) is the sum of polygene effects for all markers

This random effect is assumed to be caused by polygenic effects of all markers, thus \[ \xi = \sum_{k=1}^m Z_k \delta_k \] where \(Z_k\) is the score representing the genotype of k-th marker. \(\delta_k\) is the polygenic effect of the k-th marker and follows a normal distribution \[ \delta_k \sim N(0, \varphi^2) \]

Variance of \(\xi\) under the assumption

Under this assumption, the variance of \(\xi\) can be calculated as \[ var(\xi) = \sum_{k=1}^m Z_k var(\delta_k) Z_k^T = \sum_{k=1}^m Z_k Z_k^T \phi^2 = K\phi^2 \] where \[ K = \sum_{k=1}^m Z_k Z_k^T \]

Expected value and variance of \(y\)

Under the above model, the expected value of \(y\) is \[ E(y) = X\beta + Z_k\gamma_k \] and the variance of \(y\) is \[ var(y) = K\phi^2 + I\sigma^2 = (K\lambda + I) \sigma^2 \] where \[ \lambda = \phi^2 / \sigma^2 \]

Combine all fixed effects

Now, for the variables that are the source of the fixed effects, denote them collectively as \[ T_k = (X||Z_k) \] then the above model is represented as \[ y = T_k \theta_k + \xi + \epsilon \] where \[ \theta = (\beta^T, \gamma_k)^T \]

Profile likelihood for \(\lambda\)

The profile likelihood for \(\lambda\) (the likelihood of profiling out parameters other than \(\lambda\)) of the above model is, \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | \\- \frac{n-q-1}{2} \ln (y^T P_k y) + const \] where \[ H = K\lambda + I \\ P_k = H^{-1} - H^{-1} T_k (T_k^T H^{-1} T_k)^{-1} T_k^T H^{-1} \]

Estimates of the regression parameters

Once the \(\lambda\) that maximizes the profile likelihood above is found, \(\hat{\theta}\), that is, \(\hat{\beta}\) and \(\hat{\gamma_k}\), are estimated as follows. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \begin{bmatrix} X^TH^{-1}y \\ Z_k^TH^{-1}y \end{bmatrix} \] The error variance \(\sigma^2\) is estimated as \[ \hat{\sigma}^2 = \frac{1}{n-q-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k)^{T} H^{-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k) \]

Variance of the estimates of the regression parameters

The variance of the estimates of \(\beta\) and \(\gamma_k\) are \[ var \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \hat{\sigma}^2 \\= \begin{bmatrix} var(\hat{\beta}) & cov(\hat{\beta}, \hat{\gamma}_k) \\ cov(\hat{\gamma}_k, \hat{\beta}) & var(\hat{\gamma}_k) \end{bmatrix} \]

Statisitic for the significance test of a marker effect

The test of significance of a marker effect \(\hat{\gamma}_k\) is performed using the Wald test based on the estimates and their variances. \[ W_k = \frac{\hat{\gamma}_k^2}{var(\hat{\gamma}_k )} \]

Efficient calculation with the eigenvalue decomposition of \(K\)

We can efficiently perform the calculations shown above using the eigenvalue decomposition of \(K\): \[ K = UDU^T \]

Using the above equation for \(H\), we derive, \[ H = K\lambda + I = UDU^T + I = U(D\lambda + I) U^T \]

Calculation of the logarithm of \(H\) with the eigenvectors and eigenvalues

The determinant of the product of matrices is the product of the determinants of the matrices, i.e., \(|AB|=|A|||B|\). Using this equality, we can calculate the logarithm of \(H\) as \[ \ln |H| = \ln |U(D\lambda + I)U^T| = \ln |U||D\lambda + I||U^T| \\ = \ln |U||U^T||D\lambda + I| = \ln |UU^T||D\lambda + I| \\ =\ln |D\lambda + I | = \sum_{j=1}^n \ln (d_j\lambda + 1) \]

Exponential part of the profile likelihood

Using the above relationships, the exponential part of the profile likelihood \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | - \frac{n-q-1}{2} \ln (y^T P_k y) + const \] can be derived as \[ y^T(H^{-1} - H^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1})y = \\ y^TH^{-1}y - y^TH^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1}y \] This can be computed by combining \(y^TH^{-1}y, y^TH^{-1}T_k, T_k^TH^{-1}T_k\).

Calculate the profile likelihood with the eigenvectors and eigenvalues

Here, let \[ y^* = U^Ty\\ T_k^* = U^TT_k \] then \[ H^{-1} = U(D\lambda + I)^{-1} U^T \]

Thus, \[ y^TH^{-1}y = y^T U(D\lambda + I)^{-1} U^T y = y^{*T}(D\lambda + I)^{-1}y^*\\ y^TH^{-1}T_k = y^T U(D\lambda + I)^{-1} U^T T_k = y^{*T}(D\lambda + I)^{-1}T_k^*\\ T_k^TH^{-1}T_k = T_k^T U(D\lambda + I)^{-1} U^T T_k = T_k^{*T}(D\lambda + I)^{-1}T_k^* \]

Note that since \(D\lambda + I\) is a diagonal matrix, these can be calculated as follows. \[ a^T H^{-1} b = a^T U(D\lambda + I)^{-1} U^Tb=a^{*T}(D\lambda+I)^{-1}b^{*} = \sum_{j=1}^n a_j^{*T}b_j^{*}(d_j\lambda + 1)^{-1} \]

Acutual calculations

From here, we will perform some actual calculations based on the equations given above.

The first step is to compute \(ZZ^T\) and its eigenvalue decomposition.

K <- tcrossprod(Z)
eigK <- eigen(K)

From the eigenvalue decomposition results, take out \(U\) and \(D\). \(U\) is a matrix consisting of eigenvectors and \(D\) is a matrix with eigenvalues on its diagonal.

U <- eigK$vectors
D <- diag(eigK$values)

Next, \(X\) and \(Z_k\) are combined as \(T_k\).

Tk <- cbind(X, Zk)

Let’s assume that \(lambda = 0.5\). The computation of \(H\) is possible in two ways: using \(K\) or using \(U, D\).

lambda <- 0.5
I <- diag(1, nrow(K))
H <- K * lambda + I
H[1:3, 1:3]

##                    NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0            19331.0             8434.5             8330.5
## NSFTV87@64fa0112.0             8434.5            19326.0            15771.5
## NSFTV96@32a6808e.0             8330.5            15771.5            19328.5

The calculation based on the eigenvalue decomposition of \(K\).

H2 <- U %*% (D * lambda + I) %*% t(U)
H2[1:3, 1:3]

##         [,1]    [,2]    [,3]
## [1,] 19331.0  8434.5  8330.5
## [2,]  8434.5 19326.0 15771.5
## [3,]  8330.5 15771.5 19328.5

Calculation of the inverse of \(H\)

The inverse of \(H\) is similar, and can be calculated in two ways: directly by finding the inverse of \(H\), or by using \(U, D\). Note that the inverse of \(D\) will simply be a matrix with the inverse of the eigenvalues as diagonal components.

Hinv <- solve(H)
Hinv[1:3, 1:3]

##                    NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0       1.139665e-04       2.708023e-06       4.262416e-06
## NSFTV87@64fa0112.0       2.708023e-06       4.475645e-04      -5.212194e-05
## NSFTV96@32a6808e.0       4.262416e-06      -5.212194e-05       2.975169e-04

Hinv2 <- U %*% diag(1 / (eigK$values * lambda + 1)) %*% t(U)
Hinv2[1:3, 1:3]

##              [,1]          [,2]          [,3]
## [1,] 1.139665e-04  2.708023e-06  4.262416e-06
## [2,] 2.708023e-06  4.475645e-04 -5.212194e-05
## [3,] 4.262416e-06 -5.212194e-05  2.975169e-04

Map \(y\) and \(T_k\) to the coordinate system \(U\) obtained from the eigenvalue decomposition of \(H\).

y2 <- t(U) %*% y
Tk2 <- t(U) %*% Tk

In the space to be mapped, the coordinate system is orthogonal and the inverse can be easily computed.

t(y) %*% Hinv %*% y

##            [,1]
## [1,] 0.07293662

t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% y2

##            [,1]
## [1,] 0.07293662

t(y) %*% Hinv %*% Tk

##                             Zk
## [1,] 0.0004573438 7.281413e-05

t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2

##                             Zk
## [1,] 0.0004573438 7.281413e-05

t(Tk) %*% Hinv %*% Tk

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

t(Tk2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

crossprod function

To further streamline the calculation, use crossprod, a native function of R.

w <- 1 / (eigK$values * lambda + 1)
crossprod(y2 * sqrt(w))

##            [,1]
## [1,] 0.07293662

crossprod(y2 * sqrt(w), Tk2 * sqrt(w))

##                             Zk
## [1,] 0.0004573438 7.281413e-05

crossprod(Tk2 * sqrt(w))

##                             Zk
##     1.657128e-04 -9.411137e-05
## Zk -9.411137e-05  3.216068e-02

Define a function for calculating the log-likelihood

Based on the results so far, we can write a function to calculate the log-likelihood (excluding the constant part). The calculation is made on a logarithmic basis so that lambda does not take negative values.

loglike_emma <- function(log.lambda, y2, Tk2, eigK.values) {
  lambda <- exp(log.lambda) # with this trick, lambda is always positive 
  n <- nrow(Tk2)
  q2 <- ncol(Tk2)  
    # The number of columns of X is q. 
    # q2 (= q + 1) is the number of columns of Tk
  w <- 1 / (eigK$values * lambda + 1)
  yw <- y2 * sqrt(w)
  tw <- Tk2 * sqrt(w)
  yHy <- crossprod(yw)
  yHt <- crossprod(yw, tw)
  tHt <- crossprod(tw)
  ln.detH <- sum(log(eigK.values * lambda + 1))
  ll <- - 0.5 * ln.detH - 0.5 * log(det(tHt)) - 
    0.5 * (n - q2) * log(yHy - yHt %*% solve(tHt) %*% t(yHt))
  return(ll)
}

Utilize the difined function

We can use this function for computing the log likelihood (excluding the constant part) under the specific value of \(\lambda\). For example, when \(\lambda = 0.5\),

loglike_emma(log(0.5), y2, Tk2, eigK$values)

##           [,1]
## [1,] -923.4354

Optimize \(\lambda\) using optim function

Optimize \(\lambda\) (log.lambda in the computational procedure) that maximizes the profile likelihood using the optim function.

parm <- optim(par = log(0.5), fn = loglike_emma, 
              y2 = y2, Tk2 = Tk2, eigK.values = eigK$values,
              method="L-BFGS-B", lower = -100, upper = 100,
              control = list(fnscale = -1))
parm

## $par
## [1] -7.656512
## 
## $value
## [1] -674.0713
## 
## $counts
## function gradient 
##        5        5 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Conversion from \(log(\lambda)\) to \(\lambda\)

Convert the optimized log.lambda value to \(\lambda\).

opt.lambda <- exp(parm$par)
opt.lambda

## [1] 0.0004729541

Estimation of regression coefficients based on the optimized \(\lambda\) value

Next, we obtain estimates of \(\beta,\gamma_k\) based on the optimized \(\lambda\) values.

w <- 1 / (eigK$values * opt.lambda + 1)
TkHinvTk <- crossprod(Tk2 * sqrt(w))
TkHinvy <- crossprod(Tk2 * sqrt(w), y2 * sqrt(w))
Cinv <- solve(TkHinvTk)
theta <- Cinv %*% TkHinvy
theta

##          [,1]
##    2.91895653
## Zk 0.01283801

Calculation of error variance and the estimation errors of the parameters

Calculate the error variance.

yHinvy <- crossprod(y2 * sqrt(w))
s2 <- (yHinvy - t(TkHinvy) %*% Cinv %*% TkHinvy) / (n - ncol(X) -  1)
s2 <- drop(s2); s2

## [1] 0.02838054

Calculate the estimation error of the parameters.

v_theta = Cinv * s2
v_theta

##                           Zk
##    0.1653170273 0.0007806255
## Zk 0.0007806255 0.0013392777

Wald test based on chi-square

Calculate the Wald statistic and perform a chi-square test.

Wk <- theta[2]^2 / v_theta[2,2]
Wk

## [1] 0.1230622

pk <- 1 - pchisq(Wk, df = 1)
pk

## [1] 0.7257367

-log10(pk)

## [1] 0.1392209

Wald test based on F

The \(F\) test can also be performed.

pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk

## [1] 0.7259424

-log10(pk)

## [1] 0.1390978

GWAS with the association.test in gaston

GWAS using the association.test function in the gaston package.

gwas.res <- association.test(bm.wp, Y = y, 
                             method = "lmm", eigenK = eigK,
                             response = "quantitative", 
                             test = "wald", p = 0)

Extract the result of the 1000th marker

The result of the 1000th marker is shown.

gwas.res[1000, ]

##      chr     pos             id A1 A2    freqA2           h2       beta
## 1000   1 6829864 SNP-1.6828863.  C  T 0.2091413 0.0004727306 0.01283801
##              sd         p
## 1000 0.03659614 0.7257367

The \(P\) value is obtained by a chi-square test.

sd is the standard error (square root of the estimated error variance).

sqrt(v_theta[2,2])

## [1] 0.03659614

Heritability

Note that \[ \lambda = \phi^2 / \sigma^2 \] When we rewrite it with \(\phi^2 = \sigma_g^2, \sigma^2 = \sigma_e^2\), then \[ \lambda = \sigma_g^2 / \sigma_e^2 \] Also, the heritability due to the polygene effect is, \[ h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2 } \] then \[ h^2 = \frac{\lambda}{\lambda + 1} \]

Comparison with the result of the association.test function

A hand calculation also agrees with the result of the association.test function of gaston package.

h2 <- opt.lambda / (opt.lambda + 1)
h2

## [1] 0.0004727305

Manhattan plot

Draw a manhattan plot with manhattan function in gaston package.

manhattan(gwas.res, pch = 20)

\(p\)-value adjustment (Bonferroni correction)

Because we have a large number of SNPs, it is necessary to adjust p-values for taking into account false positives occurring in the multiple comparison.

To adjusting p-values for multiple comparison, first we apply the Bonferroni correction to the p-values.

p.adj <- p.adjust(gwas.res$p, method = "bonferroni")
gwas.res[p.adj < 0.05, ]

##       chr      pos              id A1 A2     freqA2           h2       beta
## 5401    2  2679995  SNP-2.2679992.  C  T 0.05263158 0.0007975582  0.3944404
## 11353   3 16688126 SNP-3.16686770.  G  A 0.31301939 0.0004212991 -0.1704106
## 11357   3 16719368 SNP-3.16718013.  G  A 0.25623269 0.0004045499 -0.2284444
## 11358   3 16733441 SNP-3.16732086.  G  T 0.37257618 0.0003592697  0.2736889
## 17088   5  5327913  SNP-5.5327890.  A  G 0.38504155 0.0003957726  0.1892406
## 17091   5  5362484  SNP-5.5362461.  G  A 0.21052632 0.0004161017 -0.2163359
## 17094   5  5371772  SNP-5.5371749.  G  A 0.47783934 0.0003193898  0.2266789
## 17095   5  5372955  SNP-5.5372932.  A  G 0.49030471 0.0003376325  0.1996932
## 17103   5  5442823  SNP-5.5442800.  G  A 0.24515235 0.0004024853  0.2053261
## 19176   5 28533147 SNP-5.28470501.  A  G 0.20914127 0.0004178987  0.2178054
## 19177   5 28548001 SNP-5.28485355.  C  T 0.07063712 0.0004071298  0.2844308
## 24574   7 22113010 SNP-7.22112016.  A  T 0.10941828 0.0003886924  0.2713847
## 24880   7 24845026 SNP-7.24844031.  C  T 0.09695291 0.0004107192  0.2572480
## 24902   7 25214651 SNP-7.25213656.  C  T 0.08587258 0.0003848025  0.3318363
##               sd            p
## 5401  0.07772526 3.878962e-07
## 11353 0.03019296 1.660961e-08
## 11357 0.03800305 1.841405e-09
## 11358 0.03012821 1.045226e-19
## 17088 0.03349821 1.611254e-08
## 17091 0.03632566 2.593449e-09
## 17094 0.02781480 3.651945e-16
## 17095 0.02713206 1.838486e-13
## 17103 0.04058167 4.201884e-07
## 19176 0.04461394 1.050138e-06
## 19177 0.05196176 4.403658e-08
## 24574 0.04749727 1.105522e-08
## 24880 0.05143631 5.694723e-07
## 24902 0.05847978 1.391951e-08

False discovery rate (FDR)

Here, we calculate false discovery rate (FDR) proposed by Benjamini and Hochberg (1995).

fdr <- p.adjust(gwas.res$p, method = "BH")
gwas.res[fdr < 0.05, ]

##       chr      pos              id A1 A2     freqA2           h2       beta
## 5401    2  2679995  SNP-2.2679992.  C  T 0.05263158 0.0007975582  0.3944404
## 7305    2 18649299 SNP-2.18643429.  C  G 0.07617729 0.0004655045  0.2614543
## 9562    3  1596846  SNP-3.1595841.  A  G 0.35318560 0.0006248802  0.2393151
## 11353   3 16688126 SNP-3.16686770.  G  A 0.31301939 0.0004212991 -0.1704106
## 11357   3 16719368 SNP-3.16718013.  G  A 0.25623269 0.0004045499 -0.2284444
## 11358   3 16733441 SNP-3.16732086.  G  T 0.37257618 0.0003592697  0.2736889
## 11359   3 16737323 SNP-3.16735968.  C  T 0.18698061 0.0004467158 -0.1892784
## 17080   5  5249951  SNP-5.5249928.  G  A 0.40304709 0.0004171054  0.1620522
## 17081   5  5265914  SNP-5.5265891.  C  T 0.40443213 0.0004185274  0.1619151
## 17087   5  5321857  SNP-5.5321834.  G  A 0.12188366 0.0004518441  0.2497093
## 17088   5  5327913  SNP-5.5327890.  A  G 0.38504155 0.0003957726  0.1892406
## 17089   5  5338205  SNP-5.5338182.  T  C 0.27562327 0.0004219442  0.1838167
## 17091   5  5362484  SNP-5.5362461.  G  A 0.21052632 0.0004161017 -0.2163359
## 17094   5  5371772  SNP-5.5371749.  G  A 0.47783934 0.0003193898  0.2266789
## 17095   5  5372955  SNP-5.5372932.  A  G 0.49030471 0.0003376325  0.1996932
## 17096   5  5375849  SNP-5.5375826.  C  A 0.08310249 0.0004278811  0.2486248
## 17099   5  5388363  SNP-5.5388340.  C  T 0.11080332 0.0004072153 -0.1839561
## 17103   5  5442823  SNP-5.5442800.  G  A 0.24515235 0.0004024853  0.2053261
## 17104   5  5458379  SNP-5.5458356.  A  T 0.44182825 0.0004322142  0.1606296
## 17109   5  5516194  SNP-5.5516131.  G  A 0.64404432 0.0004397388  0.1418035
## 19176   5 28533147 SNP-5.28470501.  A  G 0.20914127 0.0004178987  0.2178054
## 19177   5 28548001 SNP-5.28485355.  C  T 0.07063712 0.0004071298  0.2844308
## 24569   7 22057564 SNP-7.22056570.  A  G 0.56786704 0.0004251118 -0.2137732
## 24574   7 22113010 SNP-7.22112016.  A  T 0.10941828 0.0003886924  0.2713847
## 24599   7 22252173 SNP-7.22251179.  A  T 0.08725762 0.0004128007  0.2432027
## 24624   7 22529639 SNP-7.22528645.  C  T 0.09141274 0.0004189755  0.2324821
## 24688   7 23048478 SNP-7.23047484.  A  G 0.11495845 0.0003836625  0.2146536
## 24697   7 23104178 SNP-7.23103184.  C  T 0.15096953 0.0004205732  0.2002018
## 24880   7 24845026 SNP-7.24844031.  C  T 0.09695291 0.0004107192  0.2572480
## 24902   7 25214651 SNP-7.25213656.  C  T 0.08587258 0.0003848025  0.3318363
## 27987   8 24258192 SNP-8.24255477.  G  A 0.12326870 0.0004405095  0.1882927
##               sd            p
## 5401  0.07772526 3.878962e-07
## 7305  0.05920973 1.006688e-05
## 9562  0.05168201 3.647290e-06
## 11353 0.03019296 1.660961e-08
## 11357 0.03800305 1.841405e-09
## 11358 0.03012821 1.045226e-19
## 11359 0.04093086 3.757854e-06
## 17080 0.03374972 1.574135e-06
## 17081 0.03355022 1.392665e-06
## 17087 0.05486661 5.333905e-06
## 17088 0.03349821 1.611254e-08
## 17089 0.04033835 5.192132e-06
## 17091 0.03632566 2.593449e-09
## 17094 0.02781480 3.651945e-16
## 17095 0.02713206 1.838486e-13
## 17096 0.05720543 1.385317e-05
## 17099 0.03988736 3.990238e-06
## 17103 0.04058167 4.201884e-07
## 17104 0.03539888 5.687025e-06
## 17109 0.03093504 4.563572e-06
## 19176 0.04461394 1.050138e-06
## 19177 0.05196176 4.403658e-08
## 24569 0.04594978 3.282149e-06
## 24574 0.04749727 1.105522e-08
## 24599 0.05341419 5.284902e-06
## 24624 0.05347250 1.375759e-05
## 24688 0.04450263 1.411381e-06
## 24697 0.04263682 2.659358e-06
## 24880 0.05143631 5.694723e-07
## 24902 0.05847978 1.391951e-08
## 27987 0.04310488 1.252485e-05

Manhattan plot with two thresholds

Visualize these results with a Manhattan plot.

col <- rep("black", nrow(gwas.res))
col[gwas.res$chr %% 2 == 0] <- "gray50"
col[p.adj < 0.05] <- "green"
manhattan(gwas.res, pch = 20, col = col, main = "Bonferroni correction")

col <- rep("black", nrow(gwas.res))
col[gwas.res$chr %% 2 == 0] <- "gray50"
col[fdr < 0.05] <- "green"
manhattan(gwas.res, pch = 20, col = col, main = "False discovery rate")

Q-Q plot

Draw a Q-Q plot with the qqplot.pvalues function in gaston.

qqplot.pvalues(gwas.res$p, pch = 20)

10-minute GWAS

岩田洋佳 hiroiwata@g.ecc.u-tokyo.ac.jp

2024-05-09

Introduction

Objectives

Required packages

Data preparation

Read genome-wide marker data

SNP genotype score data

Read phenotypic data

Extract marker genotype data for analysis

Simple regression analysis

Statistical model

Prepare X

Prepare Z

Z matrix

Estimates of the parameters

Calculation with lm function

C matrix

Calculate the C matrix

Inverse of the C matrix

Calculate regression coefficients with the inverse of C

Error variance

Calculation of \(\hat{\sigma^2}\)

Variance of the estimated regression coefficients

Significance test of the regression coefficients \(\gamma_k\)

Actual calculation

Chi-square test of the \(W_k\)

Calculate \(-log_{10}(p)\) value

Precise distribution of \(W_k\)

Comparison with anova and lm functions

Calculation with association.test function in gaston

Extract the result of the 1000th marker

Heritability

Manhattan plot

Q-Q plot

Inflation of -log10(p) with a subpopulation structure

Mixed linear model based on relatedness inferred from marker genotypes

Statistical model

Expected value and variance of \(y\)

Model with aligned description

Variance of the random effect \(\xi\)

Assume \(\xi\) is the sum of polygene effects for all markers

Variance of \(\xi\) under the assumption

Expected value and variance of \(y\)

Combine all fixed effects

Profile likelihood for \(\lambda\)

Estimates of the regression parameters

Variance of the estimates of the regression parameters

Statisitic for the significance test of a marker effect

Efficient calculation with the eigenvalue decomposition of \(K\)

Calculation of the logarithm of \(H\) with the eigenvectors and eigenvalues

Exponential part of the profile likelihood

Calculate the profile likelihood with the eigenvectors and eigenvalues

Acutual calculations

Calculation of the inverse of \(H\)

crossprod function

Define a function for calculating the log-likelihood

Utilize the difined function

Optimize \(\lambda\) using optim function

Conversion from \(log(\lambda)\) to \(\lambda\)

Estimation of regression coefficients based on the optimized \(\lambda\) value

Calculation of error variance and the estimation errors of the parameters

Wald test based on chi-square

Wald test based on F

GWAS with the association.test in gaston

Extract the result of the 1000th marker

Heritability

Comparison with the result of the association.test function

Manhattan plot

\(p\)-value adjustment (Bonferroni correction)

False discovery rate (FDR)

Manhattan plot with two thresholds

Q-Q plot