2024-05-09
This document attempts to explain the GWAS model in 10 minutes. However, this is probably not possible. I encourage you to read this document and study it for yourself.
To execute this code, gaston package is necessary. Before loading the package, you should install the package to your R environment.
# if you have not installed the package # install.packages("gaston") require(gaston)
First, marker genotype data in vcf format (McCouch et al. 2016) are loaded.
bm <- read.vcf("HDRA-G6-4-RDP1-RDP2-NIAS.AGCT_MAF005MIS001NR_imputed.vcf.gz")
## ped stats and snps stats have been set. ## 'p' has been set. ## 'mu' and 'sigma' have been set.
bm # show the summary of the file
## A bed.matrix with 1568 individuals and 38769 markers. ## snps stats are set ## ped stats are set
bm contains SNP genotypes coded as scores, 0, 1, and 2.
as.matrix(bm[1:10, 1:4])
## SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253. ## IRGC121316@c88dcbba.0 0 0 0 0 ## IRGC121578@950ba1f2.0 0 0 0 0 ## IRGC121491@0b7d031b.0 0 0 0 2 ## IRGC121440@d02dfe4e.0 0 0 0 2 ## IRGC121896@89688646.0 0 0 0 0 ## IRGC121323@57c71a3b.0 0 0 0 0 ## IRGC121251@392016d6.0 0 0 0 0 ## IRGC121902@f1be6ab1.0 0 0 0 1 ## IRGC124483@9a9187e6.0 0 0 0 0 ## IRGC121314@6c2404e4.0 0 0 0 0
Next, read in the phenotypic data obtained for the same genetic resource accession.
# read phenotypic data pheno <- read.csv("RiceDiversityPheno4GWASGS.csv", row.names = 1)
Here, seed length is used as an example trait.
y <- pheno$Seed.length.width.ratio # target trait names(y) <- rownames(pheno) # naming the elements of y y <- na.omit(y) # remove missing elements
Since phenotypic data are not available for all lines, SNP data for lines with phenotypic data are extracted from bm. Here, the names given to y are used.
# select samples with ids contained in rownames(pheno) bm.wp <- bm[bm@ped$id %in% names(y),]
Remove monomorphic and low MAF markers
bm.wp <- select.snps(bm.wp, maf > 0.01) bm.wp
## A bed.matrix with 361 individuals and 38752 markers. ## snps stats are set ## ped stats are set
Finally, the phenotype data \(y\) are sorted in the same order as the SNP data bm.wp.
y <- y[bm.wp@ped$id] # reorder the elements of y as same as the bm@ped$id
The model for a single regression analysis can be written as \[ y = X\beta + Z_k \gamma_k + e \] where \(X\) is the design matrix of fixed effects (year, study site, etc.) including the overall mean, \(\beta\) is the fixed effect, \(Z_k\) is the scored genotype (-1, 0, 1) of the \(k\)th marker, \(\gamma_k\) is the effect of marker \(k\), \(e\) is the error term including environmental errors and other SNPs’ effects.
Here we create a design matrix for the over-all mean.
n <- length(y) X <- matrix(1, nrow = n, ncol = 1) head(X)
## [,1] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1
Note that the \(j\)th element of \(Z_k\) represents the genotype of the \(k\)th marker of the \(j\)th genotype (accession, line, variety etc), as mentioned earlier. \[ Z_{jk} = \left\{ \begin{array}{ll} -1 & \rm{for}\, A_1 A_1 \\ 0 & \rm{for}\, A_1 A_2 \\ 1 & \rm{for}\, A_2 A_2 \end{array} \right. \]
In bm.wp, it is scored at 0, 1, 2, so subtract 1 from the score to -1, 0, 1.
Z <- as.matrix(bm.wp) - 1 head(Z[,1:4])
## SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253. ## NSFTV80@02d095ba.0 -1 -1 -1 -1 ## NSFTV87@64fa0112.0 -1 -1 -1 -1 ## NSFTV96@32a6808e.0 -1 -1 -1 -1 ## NSFTV100@1f10be3d.0 -1 -1 -1 -1 ## NSFTV114@eac19fb8.0 -1 -1 -1 -1 ## NSFTV140@85551f9c.0 -1 -1 -1 -1
Estimates of the regression coefficients \(\beta\) and \(\gamma_k\) are obtained by the following matrix calculation. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \begin{bmatrix} X^Ty \\ Z_k^T y \end{bmatrix} \]
Let’s estimate the regression coefficients for the 1000th SNP (k = 1000).
k <- 1000 Zk <- Z[, k] model <- lm(y ~ Zk)
Perform a single regression analysis with R lm function.
summary(model)
## ## Call: ## lm(formula = y ~ Zk) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.8671 -0.4128 -0.1118 0.3309 1.7529 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.80993 0.03563 78.86 < 2e-16 *** ## Zk 0.09587 0.03578 2.68 0.00771 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5495 on 359 degrees of freedom ## Multiple R-squared: 0.01961, Adjusted R-squared: 0.01688 ## F-statistic: 7.18 on 1 and 359 DF, p-value: 0.007712
Let’s try to find it by matrix calculation. First, we prepare the matrix C \[ C = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix} \]
Prepare the C matrix.
xtx <- t(X) %*% X xtz <- t(X) %*% Zk ztx <- t(Zk) %*% X ztz <- t(Zk) %*% Zk C <- matrix(c(xtx, xtz, ztx, ztz), nrow = 2, ncol = 2, byrow = T) C
## [,1] [,2] ## [1,] 361 -210 ## [2,] -210 358
Calculate the inverse of C matrix.
Cinv <- solve(C) Cinv
## [,1] [,2] ## [1,] 0.004204938 0.002466584 ## [2,] 0.002466584 0.004240175
Compute estimates of the regression coefficients using the inverse matrix.
xty <- t(X) %*% y zty <- t(Zk) %*% y theta <- Cinv %*% matrix(c(xty, zty), nrow = 2, ncol = 1) theta
## [,1] ## [1,] 2.80993368 ## [2,] 0.09587171
The error variance \(\sigma^2\) can be estimated as \[ \hat{\sigma^2}=\frac{1}{n-q-1}(y-X\hat{\beta}-Z_k\hat{\gamma}_k)^T(y-X\hat{\beta}-Z_k\hat{\gamma}_k) \]
To easily calculate \(X\hat{\beta}-Z_k\hat{\gamma}_k\), \(X\) and \(Z_k\) are bundled together as \(T_k\). Then \(X\hat{\beta}-Z_k\hat{\gamma}_k\) can be calculated as \(T_k \hat{\theta}\).
Tk <- cbind(X, Zk)
Now let’s calculate the error variance.
s2 <- (1 / (n - ncol(X) - 1)) * t(y - Tk %*% theta) %*% (y - Tk %*% theta) s2 <- drop(s2); s2
## [1] 0.3019159
The variance of the estimated regression coefficients can be determined based on this estimate of the error variance. \[ \rm{var} \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} \rm{var}(\hat{\beta}) & \rm{cov}(\hat{\beta}, \hat{\gamma}_k) \\ \rm{cov}(\hat{\gamma}_k, \hat{\beta}) & \rm{var}(\hat{\gamma}_k) \end{bmatrix} = \begin{bmatrix} X^TX & X^TZ_k \\ Z_k^TX & Z_k^TZ_k \end{bmatrix}^{-1} \sigma^2 = \begin{bmatrix} C_{XX} & C_{XZ} \\ C_{ZX} & C_{ZZ} \end{bmatrix} \sigma^2 \]
The actual calculation is
v_theta = Cinv * s2 v_theta
## [,1] [,2] ## [1,] 0.0012695378 0.0007447009 ## [2,] 0.0007447009 0.0012801764
A test for whether the regression coefficient \(\gamma_k\) is zero can be performed using the following statistic \[ W_k = \frac{\hat{\gamma}_k^2}{\rm{var}(\hat{\gamma_k})} \] This is called the Wald test.
Let’s actually calculate the statistic.
gammak <- theta[2] v_gammak <- v_theta[2,2] Wk <- gammak^2 / v_gammak Wk
## [1] 7.179781
This statistic follows a chi-square distribution with one degree of freedom under the null hypothesis. Thus, \(p\)-value can be calculated as \[ p_k = 1 - \rm{Pr}(\chi_1^2 < W_k) \] Let’s calculate it.
pk <- 1 - pchisq(Wk, df = 1) pk
## [1] 0.007372972
Note that in GWAS, the exponent (\(-\log_{10}(p)\)) is often used by taking the ordinary logarithm of the \(p\) value and multiplying it by \(-1\). \[ \eta_k = -\log_{10}(p_k) \]
The actual calculation is
-log10(pk)
## [1] 2.132357
More precisely, \(W_k\) follows an \(F\) distribution with degrees of freedom \(1\), \(n - q - 1\), where \(q\) denotes the number of markers, under the null hypothesis.
pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1) pk
## [1] 0.00771248
-log10(pk)
## [1] 2.112806
This result can be obtained as a result of an analysis of variance (\(F\) test is performed) for a single regression model using \(lm\).
model <- lm(y ~ Zk) anova(model)
## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## Zk 1 2.168 2.16769 7.1798 0.007712 ** ## Residuals 359 108.388 0.30192 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s run a simple regression analysis on all markers using the association.test function in gaston.
gwas.res <- association.test(bm.wp, Y = y, method = "lm", response = "quantitative", test = "wald")
The results of the analysis of the 1000th marker are
gwas.res[1000, ]
## chr pos id A1 A2 freqA2 beta sd ## 1000 1 6829864 SNP-1.6828863. C T 0.2091413 0.09587171 0.03577955 ## p ## 1000 0.00771248
sd is the standard deviation of the estimated \(\hat{\gamma_k}\) (i.e., standard error). This can be computed as
sqrt(v_theta[2,2])
## [1] 0.03577955
The heritability calculated by this marker effect alone is, \[ h_k^2 = \frac{\sigma^2_{G_k}}{\sigma^2_{G_k}+\sigma^2} = \frac{\rm{var}(Z_k)\gamma_k^2}{\rm{var}(Z_k)\gamma_k^2+\sigma^2} = \frac{\sigma_{Z_k}^2\gamma_k^2}{\sigma_{Z_k}^2\gamma_k^2+\sigma^2} \] where \(\rm{var}(Z_k) = \sigma_{Z_k}^2\) and \(\sigma^2_{G_k} = \sigma_{Z_k}^2\gamma_k^2\).
Let’s do the actual calculations.
sgk2 <- var(Zk) * gammak^2 hk2 <- sgk2 / (sgk2 + s2); hk2
## [1] 0.01955386
Draw a Manhattan plot based on the simple regression.
manhattan(gwas.res, pch = 20)
The -log10(p) values of the significance test are inflated. This can be checked with a Q-Q plot.
qqplot.pvalues(gwas.res$p, pch = 20)
Most markers have a -log10(p) value greater than expected under the null hypothesis.
The inflation of the \(-log_{10}(p)\) values is due to the subpopulation structure among the samples and should be controlled by considering the genetic relationships among samples in a GWAS model.
This model, called the Q+K model (Yu et al. 2006), is one of the most commonly used GWAS models for quantitative traits. The model equation is as follows. \[ y = Q\alpha + X\beta + Zu + e \] Here, \(Q\) consists of a numerical value that expresses the population structure. For example, \(Q\) is the probability of belonging to a subpopulation as estimated by Structure analysis (Pritchard et al. 2000) or the score of the top principal component of genome-wide marker data. The \(\alpha\) represents the effect of each subpopulation on the phenotype and is estimated as a regression coefficient.
\(X\) represents the marker genotype of interest and \(\beta\) is its effect. The \(u\) represents the polygene effect distributed across the genome, and \(Z\) is the design matrix to link the effect to each sample. If each sample is measured once, \(Z\) becomes the unit matrix. The interpretation of \(E\) is the same as before.
Under this model, the expected value of \(y\) is, \[ E(y) = Q\hat{\alpha} + X\hat{\beta} \] and the variance is, \[ var(y) = V = ZKZ^T \sigma_u^2 + I\sigma^2 \] ## Test statistic for \(\beta\)
\[ W = \frac{\hat{\beta}^2}{var(\hat{\beta})} \]
The above model is rewritten and represented as \[ y = X\beta + Z_k \gamma_k + \xi + \epsilon \] where \(X\beta\) is the fixed effect and represents the grand mean or sub-population structure. This is represented as \(Q\alpha\) in the above model. \(Z_k \gamma_k\) is the effect of k-th marker, which is the marker of interest. \(Z_k\) represent the status of the genotypes of the marker. \(\xi\) is the random effect which is represented as \(Zu\) in the previous model. \(\epsilon\) is the error term.
We assume the random effect \(\xi\) (which is a vector with the length of sample number) follows the multi normal distribution with mean \(\mathbf0\) (which is a zero vector with the length of sample number) and variance \[ var(\xi) = K \phi^2 \]
This random effect is assumed to be caused by polygenic effects of all markers, thus \[ \xi = \sum_{k=1}^m Z_k \delta_k \] where \(Z_k\) is the score representing the genotype of k-th marker. \(\delta_k\) is the polygenic effect of the k-th marker and follows a normal distribution \[ \delta_k \sim N(0, \varphi^2) \]
Under this assumption, the variance of \(\xi\) can be calculated as \[ var(\xi) = \sum_{k=1}^m Z_k var(\delta_k) Z_k^T = \sum_{k=1}^m Z_k Z_k^T \phi^2 = K\phi^2 \] where \[ K = \sum_{k=1}^m Z_k Z_k^T \]
Under the above model, the expected value of \(y\) is \[ E(y) = X\beta + Z_k\gamma_k \] and the variance of \(y\) is \[ var(y) = K\phi^2 + I\sigma^2 = (K\lambda + I) \sigma^2 \] where \[ \lambda = \phi^2 / \sigma^2 \]
Now, for the variables that are the source of the fixed effects, denote them collectively as \[ T_k = (X||Z_k) \] then the above model is represented as \[ y = T_k \theta_k + \xi + \epsilon \] where \[ \theta = (\beta^T, \gamma_k)^T \]
The profile likelihood for \(\lambda\) (the likelihood of profiling out parameters other than \(\lambda\)) of the above model is, \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | \\- \frac{n-q-1}{2} \ln (y^T P_k y) + const \] where \[ H = K\lambda + I \\ P_k = H^{-1} - H^{-1} T_k (T_k^T H^{-1} T_k)^{-1} T_k^T H^{-1} \]
Once the \(\lambda\) that maximizes the profile likelihood above is found, \(\hat{\theta}\), that is, \(\hat{\beta}\) and \(\hat{\gamma_k}\), are estimated as follows. \[ \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \begin{bmatrix} X^TH^{-1}y \\ Z_k^TH^{-1}y \end{bmatrix} \] The error variance \(\sigma^2\) is estimated as \[ \hat{\sigma}^2 = \frac{1}{n-q-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k)^{T} H^{-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k) \]
The variance of the estimates of \(\beta\) and \(\gamma_k\) are \[ var \begin{bmatrix} \hat{\beta} \\ \hat{\gamma_k} \end{bmatrix} = \begin{bmatrix} X^TH^{-1}X & X^TH^{-1}Z_k \\ Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k \end{bmatrix}^{-1} \hat{\sigma}^2 \\= \begin{bmatrix} var(\hat{\beta}) & cov(\hat{\beta}, \hat{\gamma}_k) \\ cov(\hat{\gamma}_k, \hat{\beta}) & var(\hat{\gamma}_k) \end{bmatrix} \]
The test of significance of a marker effect \(\hat{\gamma}_k\) is performed using the Wald test based on the estimates and their variances. \[ W_k = \frac{\hat{\gamma}_k^2}{var(\hat{\gamma}_k )} \]
We can efficiently perform the calculations shown above using the eigenvalue decomposition of \(K\): \[ K = UDU^T \]
Using the above equation for \(H\), we derive, \[ H = K\lambda + I = UDU^T + I = U(D\lambda + I) U^T \]
The determinant of the product of matrices is the product of the determinants of the matrices, i.e., \(|AB|=|A|||B|\). Using this equality, we can calculate the logarithm of \(H\) as \[ \ln |H| = \ln |U(D\lambda + I)U^T| = \ln |U||D\lambda + I||U^T| \\ = \ln |U||U^T||D\lambda + I| = \ln |UU^T||D\lambda + I| \\ =\ln |D\lambda + I | = \sum_{j=1}^n \ln (d_j\lambda + 1) \]
Using the above relationships, the exponential part of the profile likelihood \[ L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k | - \frac{n-q-1}{2} \ln (y^T P_k y) + const \] can be derived as \[ y^T(H^{-1} - H^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1})y = \\ y^TH^{-1}y - y^TH^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1}y \] This can be computed by combining \(y^TH^{-1}y, y^TH^{-1}T_k, T_k^TH^{-1}T_k\).
Here, let \[ y^* = U^Ty\\ T_k^* = U^TT_k \] then \[ H^{-1} = U(D\lambda + I)^{-1} U^T \]
Thus, \[ y^TH^{-1}y = y^T U(D\lambda + I)^{-1} U^T y = y^{*T}(D\lambda + I)^{-1}y^*\\ y^TH^{-1}T_k = y^T U(D\lambda + I)^{-1} U^T T_k = y^{*T}(D\lambda + I)^{-1}T_k^*\\ T_k^TH^{-1}T_k = T_k^T U(D\lambda + I)^{-1} U^T T_k = T_k^{*T}(D\lambda + I)^{-1}T_k^* \]
Note that since \(D\lambda + I\) is a diagonal matrix, these can be calculated as follows. \[ a^T H^{-1} b = a^T U(D\lambda + I)^{-1} U^Tb=a^{*T}(D\lambda+I)^{-1}b^{*} = \sum_{j=1}^n a_j^{*T}b_j^{*}(d_j\lambda + 1)^{-1} \]
From here, we will perform some actual calculations based on the equations given above.
The first step is to compute \(ZZ^T\) and its eigenvalue decomposition.
K <- tcrossprod(Z) eigK <- eigen(K)
From the eigenvalue decomposition results, take out \(U\) and \(D\). \(U\) is a matrix consisting of eigenvectors and \(D\) is a matrix with eigenvalues on its diagonal.
U <- eigK$vectors D <- diag(eigK$values)
Next, \(X\) and \(Z_k\) are combined as \(T_k\).
Tk <- cbind(X, Zk)
Let’s assume that \(lambda = 0.5\). The computation of \(H\) is possible in two ways: using \(K\) or using \(U, D\).
lambda <- 0.5 I <- diag(1, nrow(K)) H <- K * lambda + I H[1:3, 1:3]
## NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0 ## NSFTV80@02d095ba.0 19331.0 8434.5 8330.5 ## NSFTV87@64fa0112.0 8434.5 19326.0 15771.5 ## NSFTV96@32a6808e.0 8330.5 15771.5 19328.5
The calculation based on the eigenvalue decomposition of \(K\).
H2 <- U %*% (D * lambda + I) %*% t(U) H2[1:3, 1:3]
## [,1] [,2] [,3] ## [1,] 19331.0 8434.5 8330.5 ## [2,] 8434.5 19326.0 15771.5 ## [3,] 8330.5 15771.5 19328.5
The inverse of \(H\) is similar, and can be calculated in two ways: directly by finding the inverse of \(H\), or by using \(U, D\). Note that the inverse of \(D\) will simply be a matrix with the inverse of the eigenvalues as diagonal components.
Hinv <- solve(H) Hinv[1:3, 1:3]
## NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0 ## NSFTV80@02d095ba.0 1.139665e-04 2.708023e-06 4.262416e-06 ## NSFTV87@64fa0112.0 2.708023e-06 4.475645e-04 -5.212194e-05 ## NSFTV96@32a6808e.0 4.262416e-06 -5.212194e-05 2.975169e-04
Hinv2 <- U %*% diag(1 / (eigK$values * lambda + 1)) %*% t(U) Hinv2[1:3, 1:3]
## [,1] [,2] [,3] ## [1,] 1.139665e-04 2.708023e-06 4.262416e-06 ## [2,] 2.708023e-06 4.475645e-04 -5.212194e-05 ## [3,] 4.262416e-06 -5.212194e-05 2.975169e-04
Map \(y\) and \(T_k\) to the coordinate system \(U\) obtained from the eigenvalue decomposition of \(H\).
y2 <- t(U) %*% y Tk2 <- t(U) %*% Tk
In the space to be mapped, the coordinate system is orthogonal and the inverse can be easily computed.
t(y) %*% Hinv %*% y
## [,1] ## [1,] 0.07293662
t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% y2
## [,1] ## [1,] 0.07293662
t(y) %*% Hinv %*% Tk
## Zk ## [1,] 0.0004573438 7.281413e-05
t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2
## Zk ## [1,] 0.0004573438 7.281413e-05
t(Tk) %*% Hinv %*% Tk
## Zk ## 1.657128e-04 -9.411137e-05 ## Zk -9.411137e-05 3.216068e-02
t(Tk2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2
## Zk ## 1.657128e-04 -9.411137e-05 ## Zk -9.411137e-05 3.216068e-02
To further streamline the calculation, use crossprod, a native function of R.
w <- 1 / (eigK$values * lambda + 1) crossprod(y2 * sqrt(w))
## [,1] ## [1,] 0.07293662
crossprod(y2 * sqrt(w), Tk2 * sqrt(w))
## Zk ## [1,] 0.0004573438 7.281413e-05
crossprod(Tk2 * sqrt(w))
## Zk ## 1.657128e-04 -9.411137e-05 ## Zk -9.411137e-05 3.216068e-02
Based on the results so far, we can write a function to calculate the log-likelihood (excluding the constant part). The calculation is made on a logarithmic basis so that lambda does not take negative values.
loglike_emma <- function(log.lambda, y2, Tk2, eigK.values) { lambda <- exp(log.lambda) # with this trick, lambda is always positive n <- nrow(Tk2) q2 <- ncol(Tk2) # The number of columns of X is q. # q2 (= q + 1) is the number of columns of Tk w <- 1 / (eigK$values * lambda + 1) yw <- y2 * sqrt(w) tw <- Tk2 * sqrt(w) yHy <- crossprod(yw) yHt <- crossprod(yw, tw) tHt <- crossprod(tw) ln.detH <- sum(log(eigK.values * lambda + 1)) ll <- - 0.5 * ln.detH - 0.5 * log(det(tHt)) - 0.5 * (n - q2) * log(yHy - yHt %*% solve(tHt) %*% t(yHt)) return(ll) }
We can use this function for computing the log likelihood (excluding the constant part) under the specific value of \(\lambda\). For example, when \(\lambda = 0.5\),
loglike_emma(log(0.5), y2, Tk2, eigK$values)
## [,1] ## [1,] -923.4354
Optimize \(\lambda\) (log.lambda in the computational procedure) that maximizes the profile likelihood using the optim function.
parm <- optim(par = log(0.5), fn = loglike_emma, y2 = y2, Tk2 = Tk2, eigK.values = eigK$values, method="L-BFGS-B", lower = -100, upper = 100, control = list(fnscale = -1)) parm
## $par ## [1] -7.656512 ## ## $value ## [1] -674.0713 ## ## $counts ## function gradient ## 5 5 ## ## $convergence ## [1] 0 ## ## $message ## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Convert the optimized log.lambda value to \(\lambda\).
opt.lambda <- exp(parm$par) opt.lambda
## [1] 0.0004729541
Next, we obtain estimates of \(\beta,\gamma_k\) based on the optimized \(\lambda\) values.
w <- 1 / (eigK$values * opt.lambda + 1) TkHinvTk <- crossprod(Tk2 * sqrt(w)) TkHinvy <- crossprod(Tk2 * sqrt(w), y2 * sqrt(w)) Cinv <- solve(TkHinvTk) theta <- Cinv %*% TkHinvy theta
## [,1] ## 2.91895653 ## Zk 0.01283801
Calculate the error variance.
yHinvy <- crossprod(y2 * sqrt(w)) s2 <- (yHinvy - t(TkHinvy) %*% Cinv %*% TkHinvy) / (n - ncol(X) - 1) s2 <- drop(s2); s2
## [1] 0.02838054
Calculate the estimation error of the parameters.
v_theta = Cinv * s2 v_theta
## Zk ## 0.1653170273 0.0007806255 ## Zk 0.0007806255 0.0013392777
Calculate the Wald statistic and perform a chi-square test.
Wk <- theta[2]^2 / v_theta[2,2] Wk
## [1] 0.1230622
pk <- 1 - pchisq(Wk, df = 1) pk
## [1] 0.7257367
-log10(pk)
## [1] 0.1392209
The \(F\) test can also be performed.
pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1) pk
## [1] 0.7259424
-log10(pk)
## [1] 0.1390978
GWAS using the association.test function in the gaston package.
gwas.res <- association.test(bm.wp, Y = y, method = "lmm", eigenK = eigK, response = "quantitative", test = "wald", p = 0)
The result of the 1000th marker is shown.
gwas.res[1000, ]
## chr pos id A1 A2 freqA2 h2 beta ## 1000 1 6829864 SNP-1.6828863. C T 0.2091413 0.0004727306 0.01283801 ## sd p ## 1000 0.03659614 0.7257367
The \(P\) value is obtained by a chi-square test.
sd is the standard error (square root of the estimated error variance).
sqrt(v_theta[2,2])
## [1] 0.03659614
Note that \[ \lambda = \phi^2 / \sigma^2 \] When we rewrite it with \(\phi^2 = \sigma_g^2, \sigma^2 = \sigma_e^2\), then \[ \lambda = \sigma_g^2 / \sigma_e^2 \] Also, the heritability due to the polygene effect is, \[ h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2 } \] then \[ h^2 = \frac{\lambda}{\lambda + 1} \]
A hand calculation also agrees with the result of the association.test function of gaston package.
h2 <- opt.lambda / (opt.lambda + 1) h2
## [1] 0.0004727305
Draw a manhattan plot with manhattan function in gaston package.
manhattan(gwas.res, pch = 20)
Because we have a large number of SNPs, it is necessary to adjust p-values for taking into account false positives occurring in the multiple comparison.
To adjusting p-values for multiple comparison, first we apply the Bonferroni correction to the p-values.
p.adj <- p.adjust(gwas.res$p, method = "bonferroni") gwas.res[p.adj < 0.05, ]
## chr pos id A1 A2 freqA2 h2 beta ## 5401 2 2679995 SNP-2.2679992. C T 0.05263158 0.0007975582 0.3944404 ## 11353 3 16688126 SNP-3.16686770. G A 0.31301939 0.0004212991 -0.1704106 ## 11357 3 16719368 SNP-3.16718013. G A 0.25623269 0.0004045499 -0.2284444 ## 11358 3 16733441 SNP-3.16732086. G T 0.37257618 0.0003592697 0.2736889 ## 17088 5 5327913 SNP-5.5327890. A G 0.38504155 0.0003957726 0.1892406 ## 17091 5 5362484 SNP-5.5362461. G A 0.21052632 0.0004161017 -0.2163359 ## 17094 5 5371772 SNP-5.5371749. G A 0.47783934 0.0003193898 0.2266789 ## 17095 5 5372955 SNP-5.5372932. A G 0.49030471 0.0003376325 0.1996932 ## 17103 5 5442823 SNP-5.5442800. G A 0.24515235 0.0004024853 0.2053261 ## 19176 5 28533147 SNP-5.28470501. A G 0.20914127 0.0004178987 0.2178054 ## 19177 5 28548001 SNP-5.28485355. C T 0.07063712 0.0004071298 0.2844308 ## 24574 7 22113010 SNP-7.22112016. A T 0.10941828 0.0003886924 0.2713847 ## 24880 7 24845026 SNP-7.24844031. C T 0.09695291 0.0004107192 0.2572480 ## 24902 7 25214651 SNP-7.25213656. C T 0.08587258 0.0003848025 0.3318363 ## sd p ## 5401 0.07772526 3.878962e-07 ## 11353 0.03019296 1.660961e-08 ## 11357 0.03800305 1.841405e-09 ## 11358 0.03012821 1.045226e-19 ## 17088 0.03349821 1.611254e-08 ## 17091 0.03632566 2.593449e-09 ## 17094 0.02781480 3.651945e-16 ## 17095 0.02713206 1.838486e-13 ## 17103 0.04058167 4.201884e-07 ## 19176 0.04461394 1.050138e-06 ## 19177 0.05196176 4.403658e-08 ## 24574 0.04749727 1.105522e-08 ## 24880 0.05143631 5.694723e-07 ## 24902 0.05847978 1.391951e-08
Here, we calculate false discovery rate (FDR) proposed by Benjamini and Hochberg (1995).
fdr <- p.adjust(gwas.res$p, method = "BH") gwas.res[fdr < 0.05, ]
## chr pos id A1 A2 freqA2 h2 beta ## 5401 2 2679995 SNP-2.2679992. C T 0.05263158 0.0007975582 0.3944404 ## 7305 2 18649299 SNP-2.18643429. C G 0.07617729 0.0004655045 0.2614543 ## 9562 3 1596846 SNP-3.1595841. A G 0.35318560 0.0006248802 0.2393151 ## 11353 3 16688126 SNP-3.16686770. G A 0.31301939 0.0004212991 -0.1704106 ## 11357 3 16719368 SNP-3.16718013. G A 0.25623269 0.0004045499 -0.2284444 ## 11358 3 16733441 SNP-3.16732086. G T 0.37257618 0.0003592697 0.2736889 ## 11359 3 16737323 SNP-3.16735968. C T 0.18698061 0.0004467158 -0.1892784 ## 17080 5 5249951 SNP-5.5249928. G A 0.40304709 0.0004171054 0.1620522 ## 17081 5 5265914 SNP-5.5265891. C T 0.40443213 0.0004185274 0.1619151 ## 17087 5 5321857 SNP-5.5321834. G A 0.12188366 0.0004518441 0.2497093 ## 17088 5 5327913 SNP-5.5327890. A G 0.38504155 0.0003957726 0.1892406 ## 17089 5 5338205 SNP-5.5338182. T C 0.27562327 0.0004219442 0.1838167 ## 17091 5 5362484 SNP-5.5362461. G A 0.21052632 0.0004161017 -0.2163359 ## 17094 5 5371772 SNP-5.5371749. G A 0.47783934 0.0003193898 0.2266789 ## 17095 5 5372955 SNP-5.5372932. A G 0.49030471 0.0003376325 0.1996932 ## 17096 5 5375849 SNP-5.5375826. C A 0.08310249 0.0004278811 0.2486248 ## 17099 5 5388363 SNP-5.5388340. C T 0.11080332 0.0004072153 -0.1839561 ## 17103 5 5442823 SNP-5.5442800. G A 0.24515235 0.0004024853 0.2053261 ## 17104 5 5458379 SNP-5.5458356. A T 0.44182825 0.0004322142 0.1606296 ## 17109 5 5516194 SNP-5.5516131. G A 0.64404432 0.0004397388 0.1418035 ## 19176 5 28533147 SNP-5.28470501. A G 0.20914127 0.0004178987 0.2178054 ## 19177 5 28548001 SNP-5.28485355. C T 0.07063712 0.0004071298 0.2844308 ## 24569 7 22057564 SNP-7.22056570. A G 0.56786704 0.0004251118 -0.2137732 ## 24574 7 22113010 SNP-7.22112016. A T 0.10941828 0.0003886924 0.2713847 ## 24599 7 22252173 SNP-7.22251179. A T 0.08725762 0.0004128007 0.2432027 ## 24624 7 22529639 SNP-7.22528645. C T 0.09141274 0.0004189755 0.2324821 ## 24688 7 23048478 SNP-7.23047484. A G 0.11495845 0.0003836625 0.2146536 ## 24697 7 23104178 SNP-7.23103184. C T 0.15096953 0.0004205732 0.2002018 ## 24880 7 24845026 SNP-7.24844031. C T 0.09695291 0.0004107192 0.2572480 ## 24902 7 25214651 SNP-7.25213656. C T 0.08587258 0.0003848025 0.3318363 ## 27987 8 24258192 SNP-8.24255477. G A 0.12326870 0.0004405095 0.1882927 ## sd p ## 5401 0.07772526 3.878962e-07 ## 7305 0.05920973 1.006688e-05 ## 9562 0.05168201 3.647290e-06 ## 11353 0.03019296 1.660961e-08 ## 11357 0.03800305 1.841405e-09 ## 11358 0.03012821 1.045226e-19 ## 11359 0.04093086 3.757854e-06 ## 17080 0.03374972 1.574135e-06 ## 17081 0.03355022 1.392665e-06 ## 17087 0.05486661 5.333905e-06 ## 17088 0.03349821 1.611254e-08 ## 17089 0.04033835 5.192132e-06 ## 17091 0.03632566 2.593449e-09 ## 17094 0.02781480 3.651945e-16 ## 17095 0.02713206 1.838486e-13 ## 17096 0.05720543 1.385317e-05 ## 17099 0.03988736 3.990238e-06 ## 17103 0.04058167 4.201884e-07 ## 17104 0.03539888 5.687025e-06 ## 17109 0.03093504 4.563572e-06 ## 19176 0.04461394 1.050138e-06 ## 19177 0.05196176 4.403658e-08 ## 24569 0.04594978 3.282149e-06 ## 24574 0.04749727 1.105522e-08 ## 24599 0.05341419 5.284902e-06 ## 24624 0.05347250 1.375759e-05 ## 24688 0.04450263 1.411381e-06 ## 24697 0.04263682 2.659358e-06 ## 24880 0.05143631 5.694723e-07 ## 24902 0.05847978 1.391951e-08 ## 27987 0.04310488 1.252485e-05
Visualize these results with a Manhattan plot.
col <- rep("black", nrow(gwas.res)) col[gwas.res$chr %% 2 == 0] <- "gray50" col[p.adj < 0.05] <- "green" manhattan(gwas.res, pch = 20, col = col, main = "Bonferroni correction")
col <- rep("black", nrow(gwas.res)) col[gwas.res$chr %% 2 == 0] <- "gray50" col[fdr < 0.05] <- "green" manhattan(gwas.res, pch = 20, col = col, main = "False discovery rate")
Draw a Q-Q plot with the qqplot.pvalues function in gaston.
qqplot.pvalues(gwas.res$p, pch = 20)