SumStat

## [1] "Document was last updated at 2023-03-15."

#Table of contents {.tabset .tabset-fade .tabset-pills}

##1 Theory

Background

Several studies have used summary statistics for the inference of genetic architecture

Theory

The theory of the paper can be found in Eur J Hum Genet, 2017, 25:137-46. For a pair of cohort their correlation of their z scores can be defined

\(\rho=cor(\textbf{z}_1,\textbf{z}_2)=\frac{\frac{\sqrt{h^2_1h^2_2}}{m}\gamma_G+\frac{\gamma_{o}}{\sqrt{n_1n_2}}\gamma_P}{\sqrt{\frac{h^2_1}{m}+\frac{1}{n_1}}\sqrt{\frac{h^2_2}{m}+\frac{1}{n_2}}}\)

Notation	Interpretation
\(\textbf{z}\)	z scores, a vector of \(M\times1\), for SNPs
\(h^2_{.}\)	heritbility for a trait
\(m\)	SNPs
\(n_{.}\)	Sample size
\(\gamma_G\)	Genetic correlation for a pair of traits
\(\gamma_P\)	Phenotypic correlation for a pair of traits
\(r_{o}=\frac{n_o}{\sqrt{n_1n_2}}\)	Overlapping samples between a pair of cohorts

##2 Experiment design

				Method
Design	\(cor(\beta_1,\beta_2)\)		\(\lambda_{meta}\)		PPSR
No-consortia	\(h^2=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	\(h^2=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	NA	NA
	\(h^2!=0\), \(E(n_e)!=n_e\)	NA	\(h^2!=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	NA	NA
consortia	\(h^2=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	\(h^2=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	\(n_{e,0}\),\(n_{e,1}\)	decouple, GLM, \(\chi^2\), deep clean
	\(h^2!=0\), \(E(n_e)!=n_e\)	NA	\(h^2!=0\), \(E(n_e)=n_e\)	decouple, GLM, \(\chi^2\)	\(n_{e,0}\),\(n_{e,1}\)	decouple, GLM, \(\chi^2\), deep clean

##3 Simulating a cohort

Parameter setting (no LD):

Sample size, \(n=\{1000, 2000\}\)
number of markers, \(m = \{1,000, 10,000\}\)
number of QTLs, \(m_{Q} = \{100, 500, 1,000\} (<=m)\)
For \(m\) loci, its allele frequency is from uniform distribution \(p \sim U(0.05, 0.95)\)
Genetic effect is sampled from \(\mathbf{\beta}_i \sim N(0, \frac{h^2}{m_Q})\)
heritability, \(h^2=\{0.1, 0.2, 0.5\}\)

The whole genotype matrix is \(\mathbf{G}_{n\times m}\). For individual \(i\) its \(j^{th}\) locus \(\mathbf{G}_{ij}\) is sampled from binomial distribution \(\mathbf{B}(p_j,2)\), in which \(p_j\) is the allele frequency of the locus.

\(\mathbf{y}=\mu+\mathbf{G}\beta+e\)

in which \(e \sim N(0, \frac{\sigma^2_{(\mathbf{G\beta})}}{h^2}(1-h^2))\) so that the genotypic variation \(V_G=Var(\mathbf{G}\beta)\) can be scaled to satify heritability.

print("Write R code for simulating a single cohort")

## [1] "Write R code for simulating a single cohort"

Estimate genetic effects

\(\mathbf{y}=a+\beta_j\mathbf{g}_j+e\)

print("Write R code for estimating single-marker gwas")

## [1] "Write R code for estimating single-marker gwas"

After GWAS, we have a table for GWAS summary statistics

SNP	Ref Allele	Alt Allele	Freq	Effect	SE
\(SNP_1\)	G	A	\(\hat{p}_1\)	\(\hat\beta_1\)	\(\hat{\sigma}_{\beta_1}\)
\(SNP_2\)	T	C	\(\hat{p}_2\)	\(\hat\beta_2\)	\(\hat{\sigma}_{\beta_2}\)
\(\mathbf{SNP}_3\)	T	A	\(\hat{p}_3\)	\(\hat\beta_3\)	\(\hat{\sigma}_{\beta_3}\)
\(\mathbf{SNP}_4\)	G	C	\(\hat{p}_4\)	\(\hat\beta_4\)	\(\hat{\sigma}_{\beta_4}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(SNP_m\)	C	T	\(\hat{p}_m\)	\(\hat\beta_m\)	\(\hat{\sigma}_{\beta_m}\)

##4 Simulating a pair of cohorts

In general, the procedure is the same as for simulating a single cohort.

However, for cohort 2 that is derived from the same ancestry as cohort 1, the expected allele frequency \(p_{j,2}=p_{j,1}\).

Given the genetic correlation between cohort 1 and 2 is \(cor(\mathbf{\beta}_{1}, \mathbf{\beta}_{2})=\gamma_G\), the genetic effect for cohort 2 can be simulated as \(\beta_2=\gamma_G\beta_1+\sqrt{1-r^2}\tilde{e}\), in which \(\tilde{e} \sim (0, h^2_2)\).

Directly simulating summary statistics for GWAS

Assuming the genetic effect is estimated via LSE for the model below

\(\mathbf{y}=a+\beta_i \mathbf{g}_i+e\)

For locus \(j\), its estimated effect is distributed \(\hat\beta_i \sim N(\beta_i, \sqrt{\frac{\sigma_y^2-\sigma_{g_i}^2 \beta^2_i}{(n_1-n_u)\sigma_{g_i}^2}})\), in which \(E(\sigma_{g_i}^2)=2p_i \bar p_i\), and \(\hat{\sigma}_{\beta}^2=\frac{\sigma_y^2-\sigma_{g_i}^2 \beta^2_i}{(n_1-n_u)\sigma_{g_i}^2}\), \(n_u\) (\(n_c=2\) here for the model above) is the number of parameters in the LSE.

For a pair of cohorts we have summary statistics below

SNP	Ref Allele\(_1\)	Alt Allele\(_1\)	Freq\(_1\)	Effect\(_1\)	SE\(_1\)	Ref Allele\(_2\)	Alt Allele\(_2\)	Effect\(_2\)	Freq\(_2\)	SE\(_2\)
\(SNP_{1}\)	A	C	\(\hat{p}_{1.1}\)	\(\hat\beta_{1.1}\)	\(\hat{\sigma}_{\beta_{1.1}}\)	A	C	\(\hat{p}_{1.1}\)	\(\hat\beta_{1.2}\)	\(\hat{\sigma}_{\beta_{1.2}}\)
\(\mathbf{SNP}_{2}\)	G	C	\(\hat{p}_{2.1}\)	\(\hat\beta_{2.1}\)	\(\hat{\sigma}_{\beta_{2.1}}\)	G	C	\(\hat{p}_{2.1}\)	\(\hat\beta_{2.2}\)	\(\hat{\sigma}_{\beta_{2.2}}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(SNP_{m}\)	A	C	\(\hat{p}_{m.1}\)	\(\hat\beta_{m.1}\)	\(\hat{\sigma}_{\beta_{m.1}}\)	A	C	\(\hat{p}_{m.1}\)	\(\hat\beta_{m.2}\)	\(\hat{\sigma}_{\beta_{m.2}}\)

##5 Simulating nuclear families

M=100
Kid=2
freq=runif(M, 0.05, 0.95)
hsq=0.5
b=rnorm(M, 0, sqrt(3*hsq/M))
famSize=100
FamG=array(0, dim=c(famSize, (2+Kid)*2, M))
Phe=array(0, (2+Kid)*famSize)
for(f in 1:famSize) {
  for(i in 1:famSize) {
    fg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
    mg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
    Phe[(f-1)*(2+Kid)+1]= sum(fg[1,]*b+fg[2,]*b)
    Phe[(f-1)*(2+Kid)+2]= sum(mg[1,]*b+mg[2,]*b)

    kg=matrix(0, Kid*2, M)
    for(k in 1:Kid) {
      kg[(k-1)*2+1,] = fg[rbinom(1, 1, 0.5)+1,]
      kg[(k-1)*2+2,] = fg[rbinom(1, 1, 0.5)+1,]
      Phe[(f-1)*(2+Kid)+2+k] = sum(kg[(k-1)*2+1,]*b+kg[(k-1)*2+2,]*b)
    }
    FG=rbind(fg, mg, kg)
    FamG[f,,]=FG
  }
}
plot(freq, colMeans(colMeans(FamG[,,])), xlab="Exp freq", ylab="Obs freq", bty='n')
abline(a=0, b=1)

print(var(Phe))

## [1] 0.6681338

Phe=Phe+rnorm(length(Phe), 0, sqrt(0.5))
print(var(Phe))

## [1] 1.101563

##6 Effective overlapping samples \(n_e\)

SimuInd <- function(s, m, h2, fq, B) {
  hsq=h2
  
  Gn=matrix(0, s, m+1)
  for(i in 1:s) {
    Gn[i, 1:m] = rbinom(m, 2, fq)
    Gn[i, m+1] = sum(Gn[i,1:m]*B) + rnorm(1, 0, sqrt(1-h2))
  }
  return(Gn)
}

SimuNuFam <- function (n1, m, k, h2, fq, B) {
  
  famSize=n1
  M=m
  Kid=k
  hsq=h2
  
  freq=fq
  b=B
  #  FamG=array(0, dim=c(famSize, (2+Kid)*2, M))
  FamG=array(0, dim=c(famSize, 2+Kid, M+1))
  
  Phe=array(0, (2+Kid)*famSize)
  for(f in 1:famSize) {
    fg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
    mg=matrix(c(rbinom(M, 1, freq), rbinom(M, 1, freq)), 2, M, byrow = T)
    Phe[(f-1)*(2+Kid)+1]= sum(fg[1,1:M]*b+fg[2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
    Phe[(f-1)*(2+Kid)+2]= sum(mg[1,1:M]*b+mg[2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
    FamG[f,1,1:M]=apply(fg, 2, sum)
    FamG[f,2,1:M]=apply(mg, 2, sum)
    
    kg=matrix(0, Kid*2, M)
    for(k in 1:Kid) {
      kg[(k-1)*2+1,] = fg[rbinom(1, 1, 0.5)+1,]
      kg[(k-1)*2+2,] = fg[rbinom(1, 1, 0.5)+1,]
      Phe[(f-1)*(2+Kid)+2+k] = sum(kg[(k-1)*2+1,1:M]*b+kg[(k-1)*2+2,1:M]*b) + rnorm(1, 0, sqrt(1-h2))
      FamG[f,2+k,1:M]=apply(kg[c((k-1)*2+1,(k-1)*2+2),], 2, sum)
    }
  }
  FamG[,,M+1] = Phe
  return(FamG)
}


h2=0.5
M=10000
freq=runif(M, 0.05, 0.95)
b=rnorm(M, 0, sqrt(3*h2/M))

S1=1000 #sample size 1
S2=1000 #sample size 2

n0=100 #overlapping samples
n1=100 #overlapping first-degree relatives

pop1=SimuInd(S1-n0-n1, M, h2, freq, b) #independent sample
pop2=SimuInd(S2-n0-n1, M, h2, freq, b) #independent sample
Gn0=SimuInd(n0, M, h2, freq, b) #overlapping samples
Gn1=SimuNuFam(n1, 10000, 2, h2, freq, b) #nuclear family

sample1=rbind(pop1, Gn0, Gn1[,3,])
sample2=rbind(pop2, Gn0, Gn1[,4,])

SS1=matrix(0, nrow = M, 4)
for(i in 1:M) {
  mod=lm(sample1[,M+1]~sample1[,i])
  SS1[i,]=summary(mod)$coefficient[2,]
}

SS2=matrix(0, nrow = M, 4)
for(i in 1:M) {
  mod=lm(sample2[,M+1]~sample2[,i])
  SS2[i,]=summary(mod)$coefficient[2,]
}
plot(SS1[,1], SS2[,1], bty="n", pch=16)

##7 \(\lambda_{meta}\)

Given the summary statistics below, we have

SNP	Ref Allele\(_1\)	Alt Allele\(_1\)	Freq\(_1\)	Effect\(_1\)	SE\(_1\)	Ref Allele\(_2\)	Alt Allele\(_2\)	Effect\(_2\)	Freq\(_2\)	SE\(_2\)
\(SNP_{1}\)	A	C	\(\hat{p}_{1.1}\)	\(\hat\beta_{1.1}\)	\(\hat{\sigma}_{\beta_{1.1}}\)	A	C	\(\hat{p}_{1.1}\)	\(\hat\beta_{1.2}\)	\(\hat{\sigma}_{\beta_{1.2}}\)
\(\mathbf{SNP}_{2}\)	G	C	\(\hat{p}_{2.1}\)	\(\hat\beta_{2.1}\)	\(\hat{\sigma}_{\beta_{2.1}}\)	G	C	\(\hat{p}_{2.1}\)	\(\hat\beta_{2.2}\)	\(\hat{\sigma}_{\beta_{2.2}}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(SNP_{m}\)	A	C	\(\hat{p}_{m.1}\)	\(\hat\beta_{m.1}\)	\(\hat{\sigma}_{\beta_{m.1}}\)	A	C	\(\hat{p}_{m.1}\)	\(\hat\beta_{m.2}\)	\(\hat{\sigma}_{\beta_{m.2}}\)

We can estimate \(\lambda_{meta}\) statistic as below

\(h_i= \frac{(\beta_{j,1}-\beta_{j,2})^2} {\sigma_{j,1}^2+\sigma_{j,2}^2} \sim \chi^2_1\)

For \(m\) loci, we have \(\mathbf{h}\), a \(m\times 1\) vector. As the vector \(\mathbf{h} \sim \chi^2_1\), we can define \(\lambda_{meta}=\frac{median(\mathbf{h})}{0.455}\), in which 0.455 is the quantity for \(\chi^2_{1,p=0.5}=0.455\). Under the null hypothesis of no overlapping samples, \(E(\lambda_{meta})=1\).

\(\hat{\rho}_{\lambda_{meta}}=1-\lambda_{meta}\), and \(E(\rho_{\lambda_{meta}})=\frac{n_o}{\sqrt{n_1n_2}}\).

##8 Various overlapping relatives

Overlapping individuals
Overlapping relatives

##9 Meta-analysis

For a pair of cohorts, for the\(j^{th}\) locus, we have \(\hat\beta_{j,1}\), \(\hat{\sigma}^2_{\beta_{j,1}}\), and \(\hat\beta_{j,1}\), \(\hat{\sigma}^2_{\beta_{j,1}}\).

Using generalized (or weighted) least squares (GLS) estimates the genetic effect

\(\mathbf{y}=\mu+e\)

in which \(\mathbf{y}^T=[\hat{\beta}_{j,1}, \hat{\beta}_{j,2}]\), and \[ e= \left[ \begin{matrix} 1&\rho_{1,2} \\ \rho_{1,2}&1 \end{matrix} \right] \]

Correct and incorrect ways in estimating

Correct but subject to heterogeneity: \(\hat\rho=1-\lambda_{meta}\)

Incorrect and subject to genetic architecture: \(\hat\rho=cov(\mathbf{Z}_1, \mathbf{Z}_2)\) or \(\hat\rho=cov(\mathbf{\beta}_1, \mathbf{\beta}_2)\)

More details for the estimation method can be [1 & 2 are generalized lse, 3 & 4 are chisq test]

Am J Hum Genet, “Meta-analysis of genome-wide association studies with overlapping subjects.”, 2009:85, 862-72, Eq 3-5
Bioinformatics, “METAL: fast and efficient meta-analysis of genomewide association scans”, 2010:26, 2190-1, See their Table
Zhu Xiaofeng, Am J Hum Genet, 2014, 96:21-36
Mike Goddard, PLoS Genet, 2015,10:e1004198
Wang et al, Hum Mol Genet, 2013, 22:2303-2311

set.seed(1000)
CT=4
N=ceiling(rnorm(CT, 3000, 1000))
M=10000

#freq
P=runif(M, 0.05, 0.95)
Freq=matrix(0, M, CT)
for(i in 1:CT) {
  Freq[,i] = rnorm(M, P, sd=sqrt((P*(1-P))/(2*N[i])))
}

#mat
h2=0.5
Beta=rnorm(M, 0, sd=0)#=sqrt(h2/sum(2*P*(1-P))))
Bmat=matrix(0, M, CT)
Smat=matrix(0, M, CT)
Pmat=matrix(0, M, CT)
for(i in 1:CT) {
  Smat[,i]=1/(N[i]*2*Freq[,i]*(1-Freq[,i]))
  Bmat[,i] = rnorm(M, Beta, sd=sqrt(Smat[,i]))
  Pmat[,i] = (1-pnorm(abs(Bmat[,i]/sqrt(Smat[,i]))))*2
}

#fixed model
FisherP=matrix(0, M, 1)
Fmeta=matrix(0, M, 3)
Qmeta=matrix(0, M, 6)
for(i in 1:M) {
  FisherP[i,1]=-2*sum(log(Pmat[i,]))

  v=1/Smat[i,]
  Fmeta[i,2]=1/sum(v)
  Fmeta[i,1]=sum(Bmat[i,]*v)/sum(v)
  Fmeta[i,3]=Fmeta[i,1]/sqrt(Fmeta[i,2])

  Qmeta[i,4]=sum(v*(Bmat[i,]-Fmeta[i,1])^2)
  Qmeta[i,5]=sum(v)-sum(v^2)/sum(v)
  Qmeta[i,6]=(Qmeta[i,4]-(CT-1))/Qmeta[i,5]
  if(Qmeta[i,6]<0) {
    Qmeta[i,6]=0
  }
  vr=1/(Smat[i,]+Qmeta[i,6])
  Qmeta[i,2]=1/sum(v)+Qmeta[i,6]
  Qmeta[i,1]=sum(vr*Bmat[i,])/sum(vr)
  Qmeta[i,3]=Qmeta[i,1]^2/Qmeta[i,2]
}
layout(matrix(1:4, 2, 2))
qqplot(main="Fisher's methods", xlab=expression(paste("Theoretical ", chi[8]^2)),
       ylab=expression(paste("Observed ", chi[8]^2)),
       FisherP, rchisq(M, 8), bty="n", pch=16)
abline(a=0, b=1)

qqplot(main="Fixed model", xlab=expression(paste("Theoretical ", chi[1]^2)),
       ylab=expression(paste("Observed ", chi[1]^2)),
       rchisq(M, 1), Fmeta[,3]^2, bty='n', pch=16)
abline(a=0, b=1)

qqplot(main="Random model", xlab=expression(paste("Theoretical ", chi[1]^2)),
       ylab=expression(paste("Observed ", chi[1]^2)),
       rchisq(M,1), Qmeta[,1]^2/Qmeta[,2], bty='n', pch=16)
abline(a=0, b=1)

qqplot(main="Heterogeneity (4 cohorts)", xlab="Q", ylab=expression(paste("Theoretical ", chi[3]^2)), Qmeta[,4], rchisq(M,CT-1), bty='n', pch=16)
abline(a=0, b=1)

##10 MLM GWAMA Mixed model for GWAMA

\(\beta=\mu+Mb+e\)

\(\beta\sim N(\mu, \sigma_M+\sigma_e)\)

##11 Notes for LD simulation

A pair of loci

Notations	Interpretation
\(p_A, p_a=\bar p_A\)	Allele frequency for \(A\) and \(a\) at locus A
\(p_B, p_b=\bar p_B\)	Allele frequency for \(B\) and \(b\) at locus B
\(p_{AB}\)	Haplotype frequency for \(AB\)
\(D_{AB}\)	Linkage disequilibrium \(D_{AB}=p_{AB}-p_Ap_B\); the maximal \(D_{AB}=0.25\) is reached when \(p_A=p_B=0.5\)
\({D}_{AB}^\prime\)	Standardized \(D_{AB}\); if \(D_{AB} > 0\), \({D}_{AB}^\prime=\frac{{D}_{AB}}{min(p_ap_B,p_Ap_b)}\), if \(D_{AB} < 0\), \({D}_{AB}^\prime=\frac{{D}_{AB}}{min(p_ap_b,p_Ap_B)}\)

Haplotype table

		Locus \(\textbf{A}\)
		\(A\)	\(a\)
Locus \(\textbf{B}\)	\(B\)	\(AB: p_Ap_B+D\)	\(aB: p_ap_B-D\)	\(p_B\)
	\(b\)	\(Ab: p_Ap_b-D\)	\(ab: p_ap_b+D\)	\(p_b\)
		\(p_A\)	\(p_a\)

In haplotype simulation, under random mating, the conditional probability of generating the second locus could be expressed as (t is generation)

\(p(\mathbf{B}=B | \mathbf{A}=A)=\frac{p_Ap_B+D_{AB}^t}{p_A}\)
\(p(\mathbf{B}=b | \mathbf{A}=A)=\frac{p_Ap_b-D_{AB}^t}{p_A}\)
\(p(\mathbf{B}=B|\mathbf{A}=a)=\frac{p_ap_B-D_{AB}^t}{p_a}\)
\(p(\mathbf{B}=B|\mathbf{A}=a)=\frac{p_ap_b+D_{AB}^t}{p_a}\)

\(D_{AB}^t=D_{AB}(1-c_{AB})^t\), in which \(c_{AB}\) is the recombination between loci A and B.

print("Write R code for ld simulation")

## [1] "Write R code for ld simulation"

Reference

Devlin B., Risch N., 1995 A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29: 311-22.

Lewontin R. C., 1964 The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics 49: 49-67.

##12 Super GWAMA-design The seminal idea of the design has been sketched in \(G3, 2017, 7:943-52\).

SumStat

gc5k

7/7/2018