## [1] "Document was last updated at 2023-03-15."
#Table of contents {.tabset .tabset-fade .tabset-pills}
##1 Theory
Background
Several studies have used summary statistics for the inference of genetic architecture
Theory
The theory of the paper can be found in Eur J Hum Genet, 2017, 25:137-46. For a pair of cohort their correlation of their z scores can be defined
\(\rho=cor(\textbf{z}_1,\textbf{z}_2)=\frac{\frac{\sqrt{h^2_1h^2_2}}{m}\gamma_G+\frac{\gamma_{o}}{\sqrt{n_1n_2}}\gamma_P}{\sqrt{\frac{h^2_1}{m}+\frac{1}{n_1}}\sqrt{\frac{h^2_2}{m}+\frac{1}{n_2}}}\)
| Notation | Interpretation |
|---|---|
| \(\textbf{z}\) | z scores, a vector of \(M\times1\), for SNPs |
| \(h^2_{.}\) | heritbility for a trait |
| \(m\) | SNPs |
| \(n_{.}\) | Sample size |
| \(\gamma_G\) | Genetic correlation for a pair of traits |
| \(\gamma_P\) | Phenotypic correlation for a pair of traits |
| \(r_{o}=\frac{n_o}{\sqrt{n_1n_2}}\) | Overlapping samples between a pair of cohorts |
##2 Experiment design
| Method | ||||||
|---|---|---|---|---|---|---|
| Design | \(cor(\beta_1,\beta_2)\) | \(\lambda_{meta}\) | PPSR | |||
| No-consortia | \(h^2=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | \(h^2=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | NA | NA |
| \(h^2!=0\), \(E(n_e)!=n_e\) | NA | \(h^2!=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | NA | NA | |
| consortia | \(h^2=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | \(h^2=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | \(n_{e,0}\),\(n_{e,1}\) | decouple, GLM, \(\chi^2\), deep clean |
| \(h^2!=0\), \(E(n_e)!=n_e\) | NA | \(h^2!=0\), \(E(n_e)=n_e\) | decouple, GLM, \(\chi^2\) | \(n_{e,0}\),\(n_{e,1}\) | decouple, GLM, \(\chi^2\), deep clean |
##3 Simulating a cohort
Parameter setting (no LD):
The whole genotype matrix is \(\mathbf{G}_{n\times m}\). For individual \(i\) its \(j^{th}\) locus \(\mathbf{G}_{ij}\) is sampled from binomial distribution \(\mathbf{B}(p_j,2)\), in which \(p_j\) is the allele frequency of the locus.
\(\mathbf{y}=\mu+\mathbf{G}\beta+e\)
in which \(e \sim N(0, \frac{\sigma^2_{(\mathbf{G\beta})}}{h^2}(1-h^2))\) so that the genotypic variation \(V_G=Var(\mathbf{G}\beta)\) can be scaled to satify heritability.
print("Write R code for simulating a single cohort")## [1] "Write R code for simulating a single cohort"
Estimate genetic effects
\(\mathbf{y}=a+\beta_j\mathbf{g}_j+e\)
print("Write R code for estimating single-marker gwas")## [1] "Write R code for estimating single-marker gwas"
After GWAS, we have a table for GWAS summary statistics
| SNP | Ref Allele | Alt Allele | Freq | Effect | SE |
|---|---|---|---|---|---|
| \(SNP_1\) | G | A | \(\hat{p}_1\) | \(\hat\beta_1\) | \(\hat{\sigma}_{\beta_1}\) |
| \(SNP_2\) | T | C | \(\hat{p}_2\) | \(\hat\beta_2\) | \(\hat{\sigma}_{\beta_2}\) |
| \(\mathbf{SNP}_3\) | \(\hat{p}_3\) | \(\hat\beta_3\) | \(\hat{\sigma}_{\beta_3}\) | ||
| \(\mathbf{SNP}_4\) | \(\hat{p}_4\) | \(\hat\beta_4\) | \(\hat{\sigma}_{\beta_4}\) | ||
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(SNP_m\) | C | T | \(\hat{p}_m\) | \(\hat\beta_m\) | \(\hat{\sigma}_{\beta_m}\) |
##4 Simulating a pair of cohorts
In general, the procedure is the same as for simulating a single cohort.
However, for cohort 2 that is derived from the same ancestry as cohort 1, the expected allele frequency \(p_{j,2}=p_{j,1}\).
Given the genetic correlation between cohort 1 and 2 is \(cor(\mathbf{\beta}_{1}, \mathbf{\beta}_{2})=\gamma_G\), the genetic effect for cohort 2 can be simulated as \(\beta_2=\gamma_G\beta_1+\sqrt{1-r^2}\tilde{e}\), in which \(\tilde{e} \sim (0, h^2_2)\).
Directly simulating summary statistics for GWAS
Assuming the genetic effect is estimated via LSE for the model below
\(\mathbf{y}=a+\beta_i \mathbf{g}_i+e\)
For locus \(j\), its estimated effect is distributed \(\hat\beta_i \sim N(\beta_i, \sqrt{\frac{\sigma_y^2-\sigma_{g_i}^2 \beta^2_i}{(n_1-n_u)\sigma_{g_i}^2}})\), in which \(E(\sigma_{g_i}^2)=2p_i \bar p_i\), and \(\hat{\sigma}_{\beta}^2=\frac{\sigma_y^2-\sigma_{g_i}^2 \beta^2_i}{(n_1-n_u)\sigma_{g_i}^2}\), \(n_u\) (\(n_c=2\) here for the model above) is the number of parameters in the LSE.
For a pair of cohorts we have summary statistics below
| SNP | Ref Allele\(_1\) | Alt Allele\(_1\) | Freq\(_1\) | Effect\(_1\) | SE\(_1\) | Ref Allele\(_2\) | Alt Allele\(_2\) | Effect\(_2\) | Freq\(_2\) | SE\(_2\) |
|---|---|---|---|---|---|---|---|---|---|---|
| \(SNP_{1}\) | A | C | \(\hat{p}_{1.1}\) | \(\hat\beta_{1.1}\) | \(\hat{\sigma}_{\beta_{1.1}}\) | A | C | \(\hat{p}_{1.1}\) | \(\hat\beta_{1.2}\) | \(\hat{\sigma}_{\beta_{1.2}}\) |
| \(\mathbf{SNP}_{2}\) | \(\hat{p}_{2.1}\) | \(\hat\beta_{2.1}\) | \(\hat{\sigma}_{\beta_{2.1}}\) | \(\hat{p}_{2.1}\) | \(\hat\beta_{2.2}\) | \(\hat{\sigma}_{\beta_{2.2}}\) | ||||
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(SNP_{m}\) | A | C | \(\hat{p}_{m.1}\) | \(\hat\beta_{m.1}\) | \(\hat{\sigma}_{\beta_{m.1}}\) | A | C | \(\hat{p}_{m.1}\) | \(\hat\beta_{m.2}\) | \(\hat{\sigma}_{\beta_{m.2}}\) |
##5 Simulating nuclear families
M=100
Kid=2
freq=runif(M, 0.05, 0.95)
hsq=0.5
b=rnorm(M, 0, sqrt(3*hsq/M))
famSize=100
FamG=array(0, dim=c(famSize, (2+Kid)*2, M))
Phe=array(0, (2+Kid)*famSize)
for(f in 1:famSize) {
for(i in 1:famSize) {
fg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
mg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
Phe[(f-1)*(2+Kid)+1]= sum(fg[1,]*b+fg[2,]*b)
Phe[(f-1)*(2+Kid)+2]= sum(mg[1,]*b+mg[2,]*b)
kg=matrix(0, Kid*2, M)
for(k in 1:Kid) {
kg[(k-1)*2+1,] = fg[rbinom(1, 1, 0.5)+1,]
kg[(k-1)*2+2,] = fg[rbinom(1, 1, 0.5)+1,]
Phe[(f-1)*(2+Kid)+2+k] = sum(kg[(k-1)*2+1,]*b+kg[(k-1)*2+2,]*b)
}
FG=rbind(fg, mg, kg)
FamG[f,,]=FG
}
}
plot(freq, colMeans(colMeans(FamG[,,])), xlab="Exp freq", ylab="Obs freq", bty='n')
abline(a=0, b=1)print(var(Phe))## [1] 0.6681338
Phe=Phe+rnorm(length(Phe), 0, sqrt(0.5))
print(var(Phe))## [1] 1.101563
##6 Effective overlapping samples \(n_e\)
SimuInd <- function(s, m, h2, fq, B) {
hsq=h2
Gn=matrix(0, s, m+1)
for(i in 1:s) {
Gn[i, 1:m] = rbinom(m, 2, fq)
Gn[i, m+1] = sum(Gn[i,1:m]*B) + rnorm(1, 0, sqrt(1-h2))
}
return(Gn)
}
SimuNuFam <- function (n1, m, k, h2, fq, B) {
famSize=n1
M=m
Kid=k
hsq=h2
freq=fq
b=B
# FamG=array(0, dim=c(famSize, (2+Kid)*2, M))
FamG=array(0, dim=c(famSize, 2+Kid, M+1))
Phe=array(0, (2+Kid)*famSize)
for(f in 1:famSize) {
fg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
mg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
Phe[(f-1)*(2+Kid)+1]= sum(fg[1,1:M]*b+fg[2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
Phe[(f-1)*(2+Kid)+2]= sum(mg[1,1:M]*b+mg[2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
FamG[f,1,1:M]=apply(fg, 2, sum)
FamG[f,2,1:M]=apply(mg, 2, sum)
kg=matrix(0, Kid*2, M)
for(k in 1:Kid) {
kg[(k-1)*2+1,] = fg[rbinom(1, 1, 0.5)+1,]
kg[(k-1)*2+2,] = fg[rbinom(1, 1, 0.5)+1,]
Phe[(f-1)*(2+Kid)+2+k] = sum(kg[(k-1)*2+1,1:M]*b+kg[(k-1)*2+2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
FamG[f,2+k,1:M]=apply(kg[c((k-1)*2+1,(k-1)*2+2),], 2, sum)
}
}
FamG[,,M+1] = Phe
return(FamG)
}
h2=0.5
M=10000
freq=runif(M, 0.05, 0.95)
b=rnorm(M, 0, sqrt(3*h2/M))
S1=1000 #sample size 1
S2=1000 #sample size 2
n0=100 #overlapping samples
n1=100 #overlapping first-degree relatives
pop1=SimuInd(S1-n0-n1, M, h2, freq, b) #independent sample
pop2=SimuInd(S2-n0-n1, M, h2, freq, b) #independent sample
Gn0=SimuInd(n0, M, h2, freq, b) #overlapping samples
Gn1=SimuNuFam(n1, 10000, 2, h2, freq, b) #nuclear family
sample1=rbind(pop1, Gn0, Gn1[,3,])
sample2=rbind(pop2, Gn0, Gn1[,4,])
SS1=matrix(0, nrow = M, 4)
for(i in 1:M) {
mod=lm(sample1[,M+1]~sample1[,i])
SS1[i,]=summary(mod)$coefficient[2,]
}
SS2=matrix(0, nrow = M, 4)
for(i in 1:M) {
mod=lm(sample2[,M+1]~sample2[,i])
SS2[i,]=summary(mod)$coefficient[2,]
}
plot(SS1[,1], SS2[,1], bty="n", pch=16)##7 \(\lambda_{meta}\)
Given the summary statistics below, we have
| SNP | Ref Allele\(_1\) | Alt Allele\(_1\) | Freq\(_1\) | Effect\(_1\) | SE\(_1\) | Ref Allele\(_2\) | Alt Allele\(_2\) | Effect\(_2\) | Freq\(_2\) | SE\(_2\) |
|---|---|---|---|---|---|---|---|---|---|---|
| \(SNP_{1}\) | A | C | \(\hat{p}_{1.1}\) | \(\hat\beta_{1.1}\) | \(\hat{\sigma}_{\beta_{1.1}}\) | A | C | \(\hat{p}_{1.1}\) | \(\hat\beta_{1.2}\) | \(\hat{\sigma}_{\beta_{1.2}}\) |
| \(\mathbf{SNP}_{2}\) | \(\hat{p}_{2.1}\) | \(\hat\beta_{2.1}\) | \(\hat{\sigma}_{\beta_{2.1}}\) | \(\hat{p}_{2.1}\) | \(\hat\beta_{2.2}\) | \(\hat{\sigma}_{\beta_{2.2}}\) | ||||
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(SNP_{m}\) | A | C | \(\hat{p}_{m.1}\) | \(\hat\beta_{m.1}\) | \(\hat{\sigma}_{\beta_{m.1}}\) | A | C | \(\hat{p}_{m.1}\) | \(\hat\beta_{m.2}\) | \(\hat{\sigma}_{\beta_{m.2}}\) |
We can estimate \(\lambda_{meta}\) statistic as below
\(h_i= \frac{(\beta_{j,1}-\beta_{j,2})^2} {\sigma_{j,1}^2+\sigma_{j,2}^2} \sim \chi^2_1\)
For \(m\) loci, we have \(\mathbf{h}\), a \(m\times 1\) vector. As the vector \(\mathbf{h} \sim \chi^2_1\), we can define \(\lambda_{meta}=\frac{median(\mathbf{h})}{0.455}\), in which 0.455 is the quantity for \(\chi^2_{1,p=0.5}=0.455\). Under the null hypothesis of no overlapping samples, \(E(\lambda_{meta})=1\).
\(\hat{\rho}_{\lambda_{meta}}=1-\lambda_{meta}\), and \(E(\rho_{\lambda_{meta}})=\frac{n_o}{\sqrt{n_1n_2}}\).
##8 Various overlapping relatives
Overlapping individuals
Overlapping relatives
##9 Meta-analysis
For a pair of cohorts, for the\(j^{th}\) locus, we have \(\hat\beta_{j,1}\), \(\hat{\sigma}^2_{\beta_{j,1}}\), and \(\hat\beta_{j,1}\), \(\hat{\sigma}^2_{\beta_{j,1}}\).
Using generalized (or weighted) least squares (GLS) estimates the genetic effect
\(\mathbf{y}=\mu+e\)
in which \(\mathbf{y}^T=[\hat{\beta}_{j,1}, \hat{\beta}_{j,2}]\), and \[ e= \left[ \begin{matrix} 1&\rho_{1,2} \\ \rho_{1,2}&1 \end{matrix} \right] \]
Correct and incorrect ways in estimating
Correct but subject to heterogeneity: \(\hat\rho=1-\lambda_{meta}\)
Incorrect and subject to genetic architecture: \(\hat\rho=cov(\mathbf{Z}_1, \mathbf{Z}_2)\) or \(\hat\rho=cov(\mathbf{\beta}_1, \mathbf{\beta}_2)\)
More details for the estimation method can be [1 & 2 are generalized lse, 3 & 4 are chisq test]
Am J Hum Genet, “Meta-analysis of genome-wide association studies with overlapping subjects.”, 2009:85, 862-72, Eq 3-5
Bioinformatics, “METAL: fast and efficient meta-analysis of genomewide association scans”, 2010:26, 2190-1, See their Table
set.seed(1000)
CT=4
N=ceiling(rnorm(CT, 3000, 1000))
M=10000
#freq
P=runif(M, 0.05, 0.95)
Freq=matrix(0, M, CT)
for(i in 1:CT) {
Freq[,i] = rnorm(M, P, sd=sqrt((P*(1-P))/(2*N[i])))
}
#mat
h2=0.5
Beta=rnorm(M, 0, sd=0)#=sqrt(h2/sum(2*P*(1-P))))
Bmat=matrix(0, M, CT)
Smat=matrix(0, M, CT)
Pmat=matrix(0, M, CT)
for(i in 1:CT) {
Smat[,i]=1/(N[i]*2*Freq[,i]*(1-Freq[,i]))
Bmat[,i] = rnorm(M, Beta, sd=sqrt(Smat[,i]))
Pmat[,i] = (1-pnorm(abs(Bmat[,i]/sqrt(Smat[,i]))))*2
}
#fixed model
FisherP=matrix(0, M, 1)
Fmeta=matrix(0, M, 3)
Qmeta=matrix(0, M, 6)
for(i in 1:M) {
FisherP[i,1]=-2*sum(log(Pmat[i,]))
v=1/Smat[i,]
Fmeta[i,2]=1/sum(v)
Fmeta[i,1]=sum(Bmat[i,]*v)/sum(v)
Fmeta[i,3]=Fmeta[i,1]/sqrt(Fmeta[i,2])
Qmeta[i,4]=sum(v*(Bmat[i,]-Fmeta[i,1])^2)
Qmeta[i,5]=sum(v)-sum(v^2)/sum(v)
Qmeta[i,6]=(Qmeta[i,4]-(CT-1))/Qmeta[i,5]
if(Qmeta[i,6]<0) {
Qmeta[i,6]=0
}
vr=1/(Smat[i,]+Qmeta[i,6])
Qmeta[i,2]=1/sum(v)+Qmeta[i,6]
Qmeta[i,1]=sum(vr*Bmat[i,])/sum(vr)
Qmeta[i,3]=Qmeta[i,1]^2/Qmeta[i,2]
}
layout(matrix(1:4, 2, 2))
qqplot(main="Fisher's methods", xlab=expression(paste("Theoretical ", chi[8]^2)),
ylab=expression(paste("Observed ", chi[8]^2)),
FisherP, rchisq(M, 8), bty="n", pch=16)
abline(a=0, b=1)
qqplot(main="Fixed model", xlab=expression(paste("Theoretical ", chi[1]^2)),
ylab=expression(paste("Observed ", chi[1]^2)),
rchisq(M, 1), Fmeta[,3]^2, bty='n', pch=16)
abline(a=0, b=1)
qqplot(main="Random model", xlab=expression(paste("Theoretical ", chi[1]^2)),
ylab=expression(paste("Observed ", chi[1]^2)),
rchisq(M,1), Qmeta[,1]^2/Qmeta[,2], bty='n', pch=16)
abline(a=0, b=1)
qqplot(main="Heterogeneity (4 cohorts)", xlab="Q", ylab=expression(paste("Theoretical ", chi[3]^2)), Qmeta[,4], rchisq(M,CT-1), bty='n', pch=16)
abline(a=0, b=1)##10 MLM GWAMA Mixed model for GWAMA
\(\beta=\mu+Mb+e\)
\(\beta\sim N(\mu, \sigma_M+\sigma_e)\)
##11 Notes for LD simulation
A pair of loci
| Notations | Interpretation |
|---|---|
| \(p_A, p_a=\bar p_A\) | Allele frequency for \(A\) and \(a\) at locus A |
| \(p_B, p_b=\bar p_B\) | Allele frequency for \(B\) and \(b\) at locus B |
| \(p_{AB}\) | Haplotype frequency for \(AB\) |
| \(D_{AB}\) | Linkage disequilibrium \(D_{AB}=p_{AB}-p_Ap_B\); the maximal \(D_{AB}=0.25\) is reached when \(p_A=p_B=0.5\) |
| \({D}_{AB}^\prime\) | Standardized \(D_{AB}\); if \(D_{AB} > 0\), \({D}_{AB}^\prime=\frac{{D}_{AB}}{min(p_ap_B,p_Ap_b)}\), if \(D_{AB} < 0\), \({D}_{AB}^\prime=\frac{{D}_{AB}}{min(p_ap_b,p_Ap_B)}\) |
Haplotype table
| Locus \(\textbf{A}\) | ||||
|---|---|---|---|---|
| \(A\) | \(a\) | |||
| Locus \(\textbf{B}\) | \(B\) | \(AB: p_Ap_B+D\) | \(aB: p_ap_B-D\) | \(p_B\) |
| \(b\) | \(Ab: p_Ap_b-D\) | \(ab: p_ap_b+D\) | \(p_b\) | |
| \(p_A\) | \(p_a\) | |||
In haplotype simulation, under random mating, the conditional probability of generating the second locus could be expressed as (t is generation)
\(D_{AB}^t=D_{AB}(1-c_{AB})^t\), in which \(c_{AB}\) is the recombination between loci A and B.
print("Write R code for ld simulation")## [1] "Write R code for ld simulation"
Reference
Devlin B., Risch N., 1995 A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29: 311-22.
Lewontin R. C., 1964 The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics 49: 49-67.
##12 Super GWAMA-design The seminal idea of the design has been sketched in \(G3, 2017, 7:943-52\).