Simple regression analysis
Statistical model
The model for a single regression analysis can be written as \[
y = X\beta + Z_k \gamma_k + e
\] where \(X\) is the design
matrix of fixed effects (year, study site, etc.) including the overall
mean, \(\beta\) is the fixed effect,
\(Z_k\) is the scored genotype (-1, 0,
1) of the \(k\)th marker, \(\gamma_k\) is the effect of marker \(k\), \(e\)
is the error term including environmental errors and other SNPs’
effects.
Prepare X
Here we create a design matrix for the over-all mean.
n <- length(y)
X <- matrix(1, nrow = n, ncol = 1)
head(X)
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
## [5,] 1
## [6,] 1
Prepare Z
Note that the \(j\)th element of
\(Z_k\) represents the genotype of the
\(k\)th marker of the \(j\)th genotype (accession, line, variety
etc), as mentioned earlier. \[
Z_{jk} = \left\{
\begin{array}{ll}
-1 & \rm{for}\, A_1 A_1 \\
0 & \rm{for}\, A_1 A_2 \\
1 & \rm{for}\, A_2 A_2
\end{array}
\right.
\]
Z matrix
In bm.wp, it is scored at 0, 1, 2, so subtract 1 from the
score to -1, 0, 1.
Z <- as.matrix(bm.wp) - 1
head(Z[,1:4])
## SNP-1.8562. SNP-1.18594. SNP-1.24921. SNP-1.25253.
## NSFTV80@02d095ba.0 -1 -1 -1 -1
## NSFTV87@64fa0112.0 -1 -1 -1 -1
## NSFTV96@32a6808e.0 -1 -1 -1 -1
## NSFTV100@1f10be3d.0 -1 -1 -1 -1
## NSFTV114@eac19fb8.0 -1 -1 -1 -1
## NSFTV140@85551f9c.0 -1 -1 -1 -1
Estimates of the parameters
Estimates of the regression coefficients \(\beta\) and \(\gamma_k\) are obtained by the following
matrix calculation. \[
\begin{bmatrix}
\hat{\beta} \\
\hat{\gamma_k}
\end{bmatrix}
=
\begin{bmatrix}
X^TX & X^TZ_k \\
Z_k^TX & Z_k^TZ_k
\end{bmatrix}^{-1}
\begin{bmatrix}
X^Ty \\
Z_k^T y
\end{bmatrix}
\]
Calculation with lm function
Let’s estimate the regression coefficients for the 1000th SNP (k =
1000).
k <- 1000
Zk <- Z[, k]
model <- lm(y ~ Zk)
Perform a single regression analysis with R lm function.
summary(model)
##
## Call:
## lm(formula = y ~ Zk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8671 -0.4128 -0.1118 0.3309 1.7529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.80993 0.03563 78.86 < 2e-16 ***
## Zk 0.09587 0.03578 2.68 0.00771 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5495 on 359 degrees of freedom
## Multiple R-squared: 0.01961, Adjusted R-squared: 0.01688
## F-statistic: 7.18 on 1 and 359 DF, p-value: 0.007712
C matrix
Let’s try to find it by matrix calculation. First, we prepare the
matrix C \[
C =
\begin{bmatrix}
X^TX & X^TZ_k \\
Z_k^TX & Z_k^TZ_k
\end{bmatrix}
\]
Calculate the C matrix
Prepare the C matrix.
xtx <- t(X) %*% X
xtz <- t(X) %*% Zk
ztx <- t(Zk) %*% X
ztz <- t(Zk) %*% Zk
C <- matrix(c(xtx, xtz, ztx, ztz), nrow = 2, ncol = 2, byrow = T)
C
## [,1] [,2]
## [1,] 361 -210
## [2,] -210 358
Inverse of the C matrix
Calculate the inverse of C matrix.
Cinv <- solve(C)
Cinv
## [,1] [,2]
## [1,] 0.004204938 0.002466584
## [2,] 0.002466584 0.004240175
Calculate regression coefficients with the inverse of
C
Compute estimates of the regression coefficients using the inverse
matrix.
xty <- t(X) %*% y
zty <- t(Zk) %*% y
theta <- Cinv %*% matrix(c(xty, zty), nrow = 2, ncol = 1)
theta
## [,1]
## [1,] 2.80993368
## [2,] 0.09587171
Error variance
The error variance \(\sigma^2\) can
be estimated as \[
\hat{\sigma^2}=\frac{1}{n-q-1}(y-X\hat{\beta}-Z_k\hat{\gamma}_k)^T(y-X\hat{\beta}-Z_k\hat{\gamma}_k)
\]
Calculation of \(\hat{\sigma^2}\)
To easily calculate \(X\hat{\beta}-Z_k\hat{\gamma}_k\), \(X\) and \(Z_k\) are bundled together as \(T_k\). Then \(X\hat{\beta}-Z_k\hat{\gamma}_k\) can be
calculated as \(T_k \hat{\theta}\).
Tk <- cbind(X, Zk)
Now let’s calculate the error variance.
s2 <- (1 / (n - ncol(X) - 1)) *
t(y - Tk %*% theta) %*% (y - Tk %*% theta)
s2 <- drop(s2); s2
## [1] 0.3019159
Variance of the estimated regression coefficients
The variance of the estimated regression coefficients can be
determined based on this estimate of the error variance. \[
\rm{var}
\begin{bmatrix}
\hat{\beta} \\
\hat{\gamma_k}
\end{bmatrix}
=
\begin{bmatrix}
\rm{var}(\hat{\beta}) & \rm{cov}(\hat{\beta}, \hat{\gamma}_k) \\
\rm{cov}(\hat{\gamma}_k, \hat{\beta}) & \rm{var}(\hat{\gamma}_k)
\end{bmatrix}
=
\begin{bmatrix}
X^TX & X^TZ_k \\
Z_k^TX & Z_k^TZ_k
\end{bmatrix}^{-1} \sigma^2
=
\begin{bmatrix}
C_{XX} & C_{XZ} \\
C_{ZX} & C_{ZZ}
\end{bmatrix} \sigma^2
\]
The actual calculation is
v_theta = Cinv * s2
v_theta
## [,1] [,2]
## [1,] 0.0012695378 0.0007447009
## [2,] 0.0007447009 0.0012801764
Significance test of the regression coefficients \(\gamma_k\)
A test for whether the regression coefficient \(\gamma_k\) is zero can be performed using
the following statistic \[
W_k = \frac{\hat{\gamma}_k^2}{\rm{var}(\hat{\gamma_k})}
\] This is called the Wald test.
Actual calculation
Let’s actually calculate the statistic.
gammak <- theta[2]
v_gammak <- v_theta[2,2]
Wk <- gammak^2 / v_gammak
Wk
## [1] 7.179781
Chi-square test of the \(W_k\)
This statistic follows a chi-square distribution with one degree of
freedom under the null hypothesis. Thus, \(p\)-value can be calculated as \[
p_k = 1 - \rm{Pr}(\chi_1^2 < W_k)
\] Let’s calculate it.
pk <- 1 - pchisq(Wk, df = 1)
pk
## [1] 0.007372972
Calculate \(-log_{10}(p)\)
value
Note that in GWAS, the exponent (\(-\log_{10}(p)\)) is often used by taking
the ordinary logarithm of the \(p\)
value and multiplying it by \(-1\).
\[
\eta_k = -\log_{10}(p_k)
\]
The actual calculation is
-log10(pk)
## [1] 2.132357
Precise distribution of \(W_k\)
More precisely, \(W_k\) follows an
\(F\) distribution with degrees of
freedom \(1\), \(n - q - 1\), where \(q\) denotes the number of markers, under
the null hypothesis.
pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk
## [1] 0.00771248
-log10(pk)
## [1] 2.112806
Comparison with anova and lm functions
This result can be obtained as a result of an analysis of variance
(\(F\) test is performed) for a single
regression model using \(lm\).
model <- lm(y ~ Zk)
anova(model)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## Zk 1 2.168 2.16769 7.1798 0.007712 **
## Residuals 359 108.388 0.30192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculation with association.test function in
gaston
Let’s run a simple regression analysis on all markers using the
association.test function in gaston.
gwas.res <- association.test(bm.wp, Y = y, method = "lm",
response = "quantitative", test = "wald")
Heritability
The heritability calculated by this marker effect alone is, \[
h_k^2 = \frac{\sigma^2_{G_k}}{\sigma^2_{G_k}+\sigma^2} =
\frac{\rm{var}(Z_k)\gamma_k^2}{\rm{var}(Z_k)\gamma_k^2+\sigma^2} =
\frac{\sigma_{Z_k}^2\gamma_k^2}{\sigma_{Z_k}^2\gamma_k^2+\sigma^2}
\] where \(\rm{var}(Z_k) =
\sigma_{Z_k}^2\) and \(\sigma^2_{G_k} =
\sigma_{Z_k}^2\gamma_k^2\).
Let’s do the actual calculations.
sgk2 <- var(Zk) * gammak^2
hk2 <- sgk2 / (sgk2 + s2); hk2
## [1] 0.01955386
Manhattan plot
Draw a Manhattan plot based on the simple regression.
manhattan(gwas.res, pch = 20)

Q-Q plot
The -log10(p) values of the significance test are inflated. This can
be checked with a Q-Q plot.
qqplot.pvalues(gwas.res$p, pch = 20)

Most markers have a -log10(p) value greater than expected under the
null hypothesis.
Inflation of -log10(p) with a subpopulation structure
The inflation of the \(-log_{10}(p)\) values is due to the
subpopulation structure among the samples and should be controlled by
considering the genetic relationships among samples in a GWAS model.
Mixed linear model based on relatedness inferred from marker
genotypes
Statistical model
This model, called the Q+K model (Yu et al. 2006), is
one of the most commonly used GWAS models for quantitative traits. The
model equation is as follows. \[
y = Q\alpha + X\beta + Zu + e
\] Here, \(Q\) consists of a
numerical value that expresses the population structure. For example,
\(Q\) is the probability of belonging
to a subpopulation as estimated by Structure analysis (Pritchard
et al. 2000) or the score of the top principal component of
genome-wide marker data. The \(\alpha\)
represents the effect of each subpopulation on the phenotype and is
estimated as a regression coefficient.
\(X\) represents the marker genotype
of interest and \(\beta\) is its
effect. The \(u\) represents the
polygene effect distributed across the genome, and \(Z\) is the design matrix to link the effect
to each sample. If each sample is measured once, \(Z\) becomes the unit matrix. The
interpretation of \(E\) is the same as
before.
Expected value and variance of \(y\)
Under this model, the expected value of \(y\) is, \[
E(y) = Q\hat{\alpha} + X\hat{\beta}
\] and the variance is, \[
var(y) = V = ZKZ^T \sigma_u^2 + I\sigma^2
\] ## Test statistic for \(\beta\)
\[
W = \frac{\hat{\beta}^2}{var(\hat{\beta})}
\]
Model with aligned description
The above model is rewritten and represented as \[
y = X\beta + Z_k \gamma_k + \xi + \epsilon
\] where \(X\beta\) is the fixed
effect and represents the grand mean or sub-population structure. This
is represented as \(Q\alpha\) in the
above model. \(Z_k \gamma_k\) is the
effect of k-th marker, which is the marker of interest. \(Z_k\) represent the status of the genotypes
of the marker. \(\xi\) is the random
effect which is represented as \(Zu\)
in the previous model. \(\epsilon\) is
the error term.
Variance of the random effect \(\xi\)
We assume the random effect \(\xi\)
(which is a vector with the length of sample number) follows the multi
normal distribution with mean \(\mathbf0\) (which is a zero vector with the
length of sample number) and variance \[
var(\xi) = K \phi^2
\]
Assume \(\xi\) is the sum of
polygene effects for all markers
This random effect is assumed to be caused by polygenic effects of
all markers, thus \[
\xi = \sum_{k=1}^m Z_k \delta_k
\] where \(Z_k\) is the score
representing the genotype of k-th marker. \(\delta_k\) is the polygenic effect of the
k-th marker and follows a normal distribution \[
\delta_k \sim N(0, \varphi^2)
\]
Variance of \(\xi\) under the
assumption
Under this assumption, the variance of \(\xi\) can be calculated as \[
var(\xi) = \sum_{k=1}^m Z_k var(\delta_k) Z_k^T = \sum_{k=1}^m Z_k Z_k^T
\phi^2 = K\phi^2
\] where \[
K = \sum_{k=1}^m Z_k Z_k^T
\]
Expected value and variance of \(y\)
Under the above model, the expected value of \(y\) is \[
E(y) = X\beta + Z_k\gamma_k
\] and the variance of \(y\) is
\[
var(y) = K\phi^2 + I\sigma^2 = (K\lambda + I) \sigma^2
\] where \[
\lambda = \phi^2 / \sigma^2
\]
Combine all fixed effects
Now, for the variables that are the source of the fixed effects,
denote them collectively as \[
T_k = (X||Z_k)
\] then the above model is represented as \[
y = T_k \theta_k + \xi + \epsilon
\] where \[
\theta = (\beta^T, \gamma_k)^T
\]
Profile likelihood for \(\lambda\)
The profile likelihood for \(\lambda\) (the likelihood of profiling out
parameters other than \(\lambda\)) of
the above model is, \[
L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k |
\\- \frac{n-q-1}{2} \ln (y^T P_k y) + const
\] where \[
H = K\lambda + I \\
P_k = H^{-1} - H^{-1} T_k (T_k^T H^{-1} T_k)^{-1} T_k^T H^{-1}
\]
Estimates of the regression parameters
Once the \(\lambda\) that maximizes
the profile likelihood above is found, \(\hat{\theta}\), that is, \(\hat{\beta}\) and \(\hat{\gamma_k}\), are estimated as follows.
\[
\begin{bmatrix}
\hat{\beta} \\
\hat{\gamma_k}
\end{bmatrix}
=
\begin{bmatrix}
X^TH^{-1}X & X^TH^{-1}Z_k \\
Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k
\end{bmatrix}^{-1}
\begin{bmatrix}
X^TH^{-1}y \\
Z_k^TH^{-1}y
\end{bmatrix}
\] The error variance \(\sigma^2\) is estimated as \[
\hat{\sigma}^2 = \frac{1}{n-q-1} (y - X\hat{\beta} -
Zk\hat{\gamma}_k)^{T} H^{-1} (y - X\hat{\beta} - Zk\hat{\gamma}_k)
\]
Variance of the estimates of the regression parameters
The variance of the estimates of \(\beta\) and \(\gamma_k\) are \[
var
\begin{bmatrix}
\hat{\beta} \\
\hat{\gamma_k}
\end{bmatrix}
=
\begin{bmatrix}
X^TH^{-1}X & X^TH^{-1}Z_k \\
Z_k^TH^{-1}X & Z_k^TH^{-1}Z_k
\end{bmatrix}^{-1}
\hat{\sigma}^2 \\=
\begin{bmatrix}
var(\hat{\beta}) & cov(\hat{\beta}, \hat{\gamma}_k) \\
cov(\hat{\gamma}_k, \hat{\beta}) & var(\hat{\gamma}_k)
\end{bmatrix}
\]
Statisitic for the significance test of a marker effect
The test of significance of a marker effect \(\hat{\gamma}_k\) is performed using the
Wald test based on the estimates and their variances. \[
W_k = \frac{\hat{\gamma}_k^2}{var(\hat{\gamma}_k )}
\]
Efficient calculation with the eigenvalue decomposition of \(K\)
We can efficiently perform the calculations shown above using the
eigenvalue decomposition of \(K\):
\[
K = UDU^T
\]
Using the above equation for \(H\),
we derive, \[
H = K\lambda + I = UDU^T + I = U(D\lambda + I) U^T
\]
Calculation of the logarithm of \(H\) with the eigenvectors and
eigenvalues
The determinant of the product of matrices is the product of the
determinants of the matrices, i.e., \(|AB|=|A|||B|\). Using this equality, we can
calculate the logarithm of \(H\) as
\[
\ln |H| = \ln |U(D\lambda + I)U^T| = \ln |U||D\lambda + I||U^T| \\
= \ln |U||U^T||D\lambda + I| = \ln |UU^T||D\lambda + I| \\
=\ln |D\lambda + I | = \sum_{j=1}^n \ln (d_j\lambda + 1)
\]
Exponential part of the profile likelihood
Using the above relationships, the exponential part of the profile
likelihood \[
L(\lambda) = -\frac{1}{2} \ln |H| - \frac{1}{2} \ln | T_k^T H^{-1} T_k |
- \frac{n-q-1}{2} \ln (y^T P_k y) + const
\] can be derived as \[
y^T(H^{-1} - H^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1})y = \\
y^TH^{-1}y - y^TH^{-1}T_k(T_k^TH^{-1}T_k)^{-1}T_k^TH^{-1}y
\] This can be computed by combining \(y^TH^{-1}y, y^TH^{-1}T_k,
T_k^TH^{-1}T_k\).
Calculate the profile likelihood with the eigenvectors and
eigenvalues
Here, let \[
y^* = U^Ty\\
T_k^* = U^TT_k
\] then \[
H^{-1} = U(D\lambda + I)^{-1} U^T
\]
Thus, \[
y^TH^{-1}y = y^T U(D\lambda + I)^{-1} U^T y = y^{*T}(D\lambda +
I)^{-1}y^*\\
y^TH^{-1}T_k = y^T U(D\lambda + I)^{-1} U^T T_k = y^{*T}(D\lambda +
I)^{-1}T_k^*\\
T_k^TH^{-1}T_k = T_k^T U(D\lambda + I)^{-1} U^T T_k = T_k^{*T}(D\lambda
+ I)^{-1}T_k^*
\]
Note that since \(D\lambda + I\) is
a diagonal matrix, these can be calculated as follows. \[
a^T H^{-1} b = a^T U(D\lambda + I)^{-1}
U^Tb=a^{*T}(D\lambda+I)^{-1}b^{*} = \sum_{j=1}^n
a_j^{*T}b_j^{*}(d_j\lambda + 1)^{-1}
\]
Acutual calculations
From here, we will perform some actual calculations based on the
equations given above.
The first step is to compute \(ZZ^T\) and its eigenvalue
decomposition.
K <- tcrossprod(Z)
eigK <- eigen(K)
From the eigenvalue decomposition results, take out \(U\) and \(D\). \(U\)
is a matrix consisting of eigenvectors and \(D\) is a matrix with eigenvalues on its
diagonal.
U <- eigK$vectors
D <- diag(eigK$values)
Next, \(X\) and \(Z_k\) are combined as \(T_k\).
Tk <- cbind(X, Zk)
Let’s assume that \(lambda = 0.5\).
The computation of \(H\) is possible in
two ways: using \(K\) or using \(U, D\).
lambda <- 0.5
I <- diag(1, nrow(K))
H <- K * lambda + I
H[1:3, 1:3]
## NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0 19331.0 8434.5 8330.5
## NSFTV87@64fa0112.0 8434.5 19326.0 15771.5
## NSFTV96@32a6808e.0 8330.5 15771.5 19328.5
The calculation based on the eigenvalue decomposition of \(K\).
H2 <- U %*% (D * lambda + I) %*% t(U)
H2[1:3, 1:3]
## [,1] [,2] [,3]
## [1,] 19331.0 8434.5 8330.5
## [2,] 8434.5 19326.0 15771.5
## [3,] 8330.5 15771.5 19328.5
Calculation of the inverse of \(H\)
The inverse of \(H\) is similar, and
can be calculated in two ways: directly by finding the inverse of \(H\), or by using \(U, D\). Note that the inverse of \(D\) will simply be a matrix with the
inverse of the eigenvalues as diagonal components.
Hinv <- solve(H)
Hinv[1:3, 1:3]
## NSFTV80@02d095ba.0 NSFTV87@64fa0112.0 NSFTV96@32a6808e.0
## NSFTV80@02d095ba.0 1.139665e-04 2.708023e-06 4.262416e-06
## NSFTV87@64fa0112.0 2.708023e-06 4.475645e-04 -5.212194e-05
## NSFTV96@32a6808e.0 4.262416e-06 -5.212194e-05 2.975169e-04
Hinv2 <- U %*% diag(1 / (eigK$values * lambda + 1)) %*% t(U)
Hinv2[1:3, 1:3]
## [,1] [,2] [,3]
## [1,] 1.139665e-04 2.708023e-06 4.262416e-06
## [2,] 2.708023e-06 4.475645e-04 -5.212194e-05
## [3,] 4.262416e-06 -5.212194e-05 2.975169e-04
Map \(y\) and \(T_k\) to the coordinate system \(U\) obtained from the eigenvalue
decomposition of \(H\).
y2 <- t(U) %*% y
Tk2 <- t(U) %*% Tk
In the space to be mapped, the coordinate system is orthogonal and
the inverse can be easily computed.
t(y) %*% Hinv %*% y
## [,1]
## [1,] 0.07293662
t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% y2
## [,1]
## [1,] 0.07293662
t(y) %*% Hinv %*% Tk
## Zk
## [1,] 0.0004573438 7.281413e-05
t(y2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2
## Zk
## [1,] 0.0004573438 7.281413e-05
t(Tk) %*% Hinv %*% Tk
## Zk
## 1.657128e-04 -9.411137e-05
## Zk -9.411137e-05 3.216068e-02
t(Tk2) %*% diag(1 / (eigK$values * lambda + 1)) %*% Tk2
## Zk
## 1.657128e-04 -9.411137e-05
## Zk -9.411137e-05 3.216068e-02
crossprod function
To further streamline the calculation, use crossprod, a
native function of R.
w <- 1 / (eigK$values * lambda + 1)
crossprod(y2 * sqrt(w))
## [,1]
## [1,] 0.07293662
crossprod(y2 * sqrt(w), Tk2 * sqrt(w))
## Zk
## [1,] 0.0004573438 7.281413e-05
crossprod(Tk2 * sqrt(w))
## Zk
## 1.657128e-04 -9.411137e-05
## Zk -9.411137e-05 3.216068e-02
Define a function for calculating the log-likelihood
Based on the results so far, we can write a function to calculate the
log-likelihood (excluding the constant part). The calculation is made on
a logarithmic basis so that lambda does not take negative values.
loglike_emma <- function(log.lambda, y2, Tk2, eigK.values) {
lambda <- exp(log.lambda) # with this trick, lambda is always positive
n <- nrow(Tk2)
q2 <- ncol(Tk2)
# The number of columns of X is q.
# q2 (= q + 1) is the number of columns of Tk
w <- 1 / (eigK$values * lambda + 1)
yw <- y2 * sqrt(w)
tw <- Tk2 * sqrt(w)
yHy <- crossprod(yw)
yHt <- crossprod(yw, tw)
tHt <- crossprod(tw)
ln.detH <- sum(log(eigK.values * lambda + 1))
ll <- - 0.5 * ln.detH - 0.5 * log(det(tHt)) -
0.5 * (n - q2) * log(yHy - yHt %*% solve(tHt) %*% t(yHt))
return(ll)
}
Utilize the difined function
We can use this function for computing the log likelihood (excluding
the constant part) under the specific value of \(\lambda\). For example, when \(\lambda = 0.5\),
loglike_emma(log(0.5), y2, Tk2, eigK$values)
## [,1]
## [1,] -923.4354
Optimize \(\lambda\) using
optim function
Optimize \(\lambda\) (log.lambda in
the computational procedure) that maximizes the profile likelihood using
the optim function.
parm <- optim(par = log(0.5), fn = loglike_emma,
y2 = y2, Tk2 = Tk2, eigK.values = eigK$values,
method="L-BFGS-B", lower = -100, upper = 100,
control = list(fnscale = -1))
parm
## $par
## [1] -7.656512
##
## $value
## [1] -674.0713
##
## $counts
## function gradient
## 5 5
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Conversion from \(log(\lambda)\) to
\(\lambda\)
Convert the optimized log.lambda value to \(\lambda\).
opt.lambda <- exp(parm$par)
opt.lambda
## [1] 0.0004729541
Estimation of regression coefficients based on the optimized \(\lambda\) value
Next, we obtain estimates of \(\beta,\gamma_k\) based on the optimized
\(\lambda\) values.
w <- 1 / (eigK$values * opt.lambda + 1)
TkHinvTk <- crossprod(Tk2 * sqrt(w))
TkHinvy <- crossprod(Tk2 * sqrt(w), y2 * sqrt(w))
Cinv <- solve(TkHinvTk)
theta <- Cinv %*% TkHinvy
theta
## [,1]
## 2.91895653
## Zk 0.01283801
Calculation of error variance and the estimation errors of the
parameters
Calculate the error variance.
yHinvy <- crossprod(y2 * sqrt(w))
s2 <- (yHinvy - t(TkHinvy) %*% Cinv %*% TkHinvy) / (n - ncol(X) - 1)
s2 <- drop(s2); s2
## [1] 0.02838054
Calculate the estimation error of the parameters.
v_theta = Cinv * s2
v_theta
## Zk
## 0.1653170273 0.0007806255
## Zk 0.0007806255 0.0013392777
Wald test based on chi-square
Calculate the Wald statistic and perform a chi-square test.
Wk <- theta[2]^2 / v_theta[2,2]
Wk
## [1] 0.1230622
pk <- 1 - pchisq(Wk, df = 1)
pk
## [1] 0.7257367
-log10(pk)
## [1] 0.1392209
Wald test based on F
The \(F\) test can also be
performed.
pk <- 1 - pf(Wk, df1 = 1, df2 = n - ncol(X) - 1)
pk
## [1] 0.7259424
-log10(pk)
## [1] 0.1390978
GWAS with the association.test in gaston
GWAS using the association.test function in the
gaston package.
gwas.res <- association.test(bm.wp, Y = y,
method = "lmm", eigenK = eigK,
response = "quantitative",
test = "wald", p = 0)
Heritability
Note that \[
\lambda = \phi^2 / \sigma^2
\] When we rewrite it with \(\phi^2 =
\sigma_g^2, \sigma^2 = \sigma_e^2\), then \[
\lambda = \sigma_g^2 / \sigma_e^2
\] Also, the heritability due to the polygene effect is, \[
h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2 }
\] then \[
h^2 = \frac{\lambda}{\lambda + 1}
\]
Comparison with the result of the association.test
function
A hand calculation also agrees with the result of the
association.test function of gaston package.
h2 <- opt.lambda / (opt.lambda + 1)
h2
## [1] 0.0004727305
Manhattan plot
Draw a manhattan plot with manhattan function in
gaston package.
manhattan(gwas.res, pch = 20)

\(p\)-value adjustment (Bonferroni
correction)
Because we have a large number of SNPs, it is necessary to adjust
p-values for taking into account false positives occurring in the
multiple comparison.
To adjusting p-values for multiple comparison, first we apply the
Bonferroni correction to the p-values.
p.adj <- p.adjust(gwas.res$p, method = "bonferroni")
gwas.res[p.adj < 0.05, ]
## chr pos id A1 A2 freqA2 h2 beta
## 5401 2 2679995 SNP-2.2679992. C T 0.05263158 0.0007975582 0.3944404
## 11353 3 16688126 SNP-3.16686770. G A 0.31301939 0.0004212991 -0.1704106
## 11357 3 16719368 SNP-3.16718013. G A 0.25623269 0.0004045499 -0.2284444
## 11358 3 16733441 SNP-3.16732086. G T 0.37257618 0.0003592697 0.2736889
## 17088 5 5327913 SNP-5.5327890. A G 0.38504155 0.0003957726 0.1892406
## 17091 5 5362484 SNP-5.5362461. G A 0.21052632 0.0004161017 -0.2163359
## 17094 5 5371772 SNP-5.5371749. G A 0.47783934 0.0003193898 0.2266789
## 17095 5 5372955 SNP-5.5372932. A G 0.49030471 0.0003376325 0.1996932
## 17103 5 5442823 SNP-5.5442800. G A 0.24515235 0.0004024853 0.2053261
## 19176 5 28533147 SNP-5.28470501. A G 0.20914127 0.0004178987 0.2178054
## 19177 5 28548001 SNP-5.28485355. C T 0.07063712 0.0004071298 0.2844308
## 24574 7 22113010 SNP-7.22112016. A T 0.10941828 0.0003886924 0.2713847
## 24880 7 24845026 SNP-7.24844031. C T 0.09695291 0.0004107192 0.2572480
## 24902 7 25214651 SNP-7.25213656. C T 0.08587258 0.0003848025 0.3318363
## sd p
## 5401 0.07772526 3.878962e-07
## 11353 0.03019296 1.660961e-08
## 11357 0.03800305 1.841405e-09
## 11358 0.03012821 1.045226e-19
## 17088 0.03349821 1.611254e-08
## 17091 0.03632566 2.593449e-09
## 17094 0.02781480 3.651945e-16
## 17095 0.02713206 1.838486e-13
## 17103 0.04058167 4.201884e-07
## 19176 0.04461394 1.050138e-06
## 19177 0.05196176 4.403658e-08
## 24574 0.04749727 1.105522e-08
## 24880 0.05143631 5.694723e-07
## 24902 0.05847978 1.391951e-08
False discovery rate (FDR)
Here, we calculate false discovery rate (FDR) proposed by Benjamini
and Hochberg (1995).
fdr <- p.adjust(gwas.res$p, method = "BH")
gwas.res[fdr < 0.05, ]
## chr pos id A1 A2 freqA2 h2 beta
## 5401 2 2679995 SNP-2.2679992. C T 0.05263158 0.0007975582 0.3944404
## 7305 2 18649299 SNP-2.18643429. C G 0.07617729 0.0004655045 0.2614543
## 9562 3 1596846 SNP-3.1595841. A G 0.35318560 0.0006248802 0.2393151
## 11353 3 16688126 SNP-3.16686770. G A 0.31301939 0.0004212991 -0.1704106
## 11357 3 16719368 SNP-3.16718013. G A 0.25623269 0.0004045499 -0.2284444
## 11358 3 16733441 SNP-3.16732086. G T 0.37257618 0.0003592697 0.2736889
## 11359 3 16737323 SNP-3.16735968. C T 0.18698061 0.0004467158 -0.1892784
## 17080 5 5249951 SNP-5.5249928. G A 0.40304709 0.0004171054 0.1620522
## 17081 5 5265914 SNP-5.5265891. C T 0.40443213 0.0004185274 0.1619151
## 17087 5 5321857 SNP-5.5321834. G A 0.12188366 0.0004518441 0.2497093
## 17088 5 5327913 SNP-5.5327890. A G 0.38504155 0.0003957726 0.1892406
## 17089 5 5338205 SNP-5.5338182. T C 0.27562327 0.0004219442 0.1838167
## 17091 5 5362484 SNP-5.5362461. G A 0.21052632 0.0004161017 -0.2163359
## 17094 5 5371772 SNP-5.5371749. G A 0.47783934 0.0003193898 0.2266789
## 17095 5 5372955 SNP-5.5372932. A G 0.49030471 0.0003376325 0.1996932
## 17096 5 5375849 SNP-5.5375826. C A 0.08310249 0.0004278811 0.2486248
## 17099 5 5388363 SNP-5.5388340. C T 0.11080332 0.0004072153 -0.1839561
## 17103 5 5442823 SNP-5.5442800. G A 0.24515235 0.0004024853 0.2053261
## 17104 5 5458379 SNP-5.5458356. A T 0.44182825 0.0004322142 0.1606296
## 17109 5 5516194 SNP-5.5516131. G A 0.64404432 0.0004397388 0.1418035
## 19176 5 28533147 SNP-5.28470501. A G 0.20914127 0.0004178987 0.2178054
## 19177 5 28548001 SNP-5.28485355. C T 0.07063712 0.0004071298 0.2844308
## 24569 7 22057564 SNP-7.22056570. A G 0.56786704 0.0004251118 -0.2137732
## 24574 7 22113010 SNP-7.22112016. A T 0.10941828 0.0003886924 0.2713847
## 24599 7 22252173 SNP-7.22251179. A T 0.08725762 0.0004128007 0.2432027
## 24624 7 22529639 SNP-7.22528645. C T 0.09141274 0.0004189755 0.2324821
## 24688 7 23048478 SNP-7.23047484. A G 0.11495845 0.0003836625 0.2146536
## 24697 7 23104178 SNP-7.23103184. C T 0.15096953 0.0004205732 0.2002018
## 24880 7 24845026 SNP-7.24844031. C T 0.09695291 0.0004107192 0.2572480
## 24902 7 25214651 SNP-7.25213656. C T 0.08587258 0.0003848025 0.3318363
## 27987 8 24258192 SNP-8.24255477. G A 0.12326870 0.0004405095 0.1882927
## sd p
## 5401 0.07772526 3.878962e-07
## 7305 0.05920973 1.006688e-05
## 9562 0.05168201 3.647290e-06
## 11353 0.03019296 1.660961e-08
## 11357 0.03800305 1.841405e-09
## 11358 0.03012821 1.045226e-19
## 11359 0.04093086 3.757854e-06
## 17080 0.03374972 1.574135e-06
## 17081 0.03355022 1.392665e-06
## 17087 0.05486661 5.333905e-06
## 17088 0.03349821 1.611254e-08
## 17089 0.04033835 5.192132e-06
## 17091 0.03632566 2.593449e-09
## 17094 0.02781480 3.651945e-16
## 17095 0.02713206 1.838486e-13
## 17096 0.05720543 1.385317e-05
## 17099 0.03988736 3.990238e-06
## 17103 0.04058167 4.201884e-07
## 17104 0.03539888 5.687025e-06
## 17109 0.03093504 4.563572e-06
## 19176 0.04461394 1.050138e-06
## 19177 0.05196176 4.403658e-08
## 24569 0.04594978 3.282149e-06
## 24574 0.04749727 1.105522e-08
## 24599 0.05341419 5.284902e-06
## 24624 0.05347250 1.375759e-05
## 24688 0.04450263 1.411381e-06
## 24697 0.04263682 2.659358e-06
## 24880 0.05143631 5.694723e-07
## 24902 0.05847978 1.391951e-08
## 27987 0.04310488 1.252485e-05
Manhattan plot with two thresholds
Visualize these results with a Manhattan plot.
col <- rep("black", nrow(gwas.res))
col[gwas.res$chr %% 2 == 0] <- "gray50"
col[p.adj < 0.05] <- "green"
manhattan(gwas.res, pch = 20, col = col, main = "Bonferroni correction")

col <- rep("black", nrow(gwas.res))
col[gwas.res$chr %% 2 == 0] <- "gray50"
col[fdr < 0.05] <- "green"
manhattan(gwas.res, pch = 20, col = col, main = "False discovery rate")

Q-Q plot
Draw a Q-Q plot with the qqplot.pvalues function in
gaston.
qqplot.pvalues(gwas.res$p, pch = 20)
