Initially, there are 784,256 raw SNPs in total. The raw SNP data was filtered by MAF (0.05), HWE (1e-6), missing rate (0.05). Then the sparse missing genotypes were imputed using Beagle.
Then, within each sub-group (All, Brothers, Sisters, Bro-Sis pairs, Bro pairs + Sis pairs, Subset sisters), we further filtered SNPs by MAF (0.05) and HWE (1e-6) and the number of SNPs are as follows:
The genome length measured in Morgan can be put in the following:
\[ m_{e.\text{IBS}}=\text{var}(\text{GRM}_{\text{sib}})=\frac{1}{16L} \]
We can depict the distribution of IBS across differnt groups of families and calculate the \(\frac{1}{\text{var}(\text{GRM}_{\text{sib}})}\) using clean SNPs
Compared with Visscher et al. 2006, we can see that there is a good-fitting line between \(\text{var}(\text{IBS})\) and the length calculated in their paper:
The variance IBS score (\(g_i\)) for \(i^{th}\) locus between a pair sib is \(\text{var}(g_i)=\theta^2r_{ii}^2+r_{ii}^2\). We calculated \(\text{var}(g_i)\) across all loci with minor allel frequency greater than 0.05 and number of genotyped families greater than 500, and the density of \(\text{var}(g_i)\) can be shown below:
The pattern of \(\text{var}(g_i)\) across differnt MAF can be observed below (The red line is theoritical line):
According to SimulationLD.pdf, I simulated a toy example
with two loci:
set.seed(12345)
# Allele frequency of two loci
f_A<-0.6
f_a<-1-f_A
f_B<-0.7
f_b<-1-f_B
# D and generation
D<-0.1
t<-1
# Sample size 100
size<-100
# First locus sampled based on allele frequency
gen1<-sample(0:1,size*2,replace = T,prob=c(f_a,f_A))
# Second locus
gen2<-numeric(size*2)
for(i in 1:length(gen2)){
if(gen1[i]==1){
gen2[i]<-sample(0:1,1,prob=c((f_A*f_b-D^t)/f_A,(f_A*f_B+D^t)/f_A))
}else{
gen2[i]<-sample(0:1,1,prob=c((f_a*f_b+D^t)/f_a,(f_a*f_B-D^t)/f_a))
}
}
# Re-estimate allele frequencies and D from samples
p_A<-sum(gen1)/(2*size)
p_B<-sum(gen2)/(2*size)
D_hat<-sum(gen1==1 & gen2==1)/(2*size) - p_A*p_B
cat(" Real allele frequency of A is ", f_A, "; Estimated allele frequency of A is ", p_A, "\n",
"Real allele frequency of B is ", f_B, "; Estimated allele frequency of A is ", p_B, "\n",
"Real D is ", D, "; Estimated D is ", D_hat, "\n"
)
## Real allele frequency of A is 0.6 ; Estimated allele frequency of A is 0.51
## Real allele frequency of B is 0.7 ; Estimated allele frequency of A is 0.745
## Real D is 0.1 ; Estimated D is 0.09005
In contrast, for IBD for a pair of sib is \((1-2c_{ij})^2\), in which \(c_{ij}\) is the recombination faction between a pair of loci \(i\) and \(j\).
Two scenarios:
SingleSib: randomly selected one individual from each family and calculate \(h^2\) using HE-reg method implemented in GCTA across 81 quantitative traits and 251 metabolism traits
Sib: Calculate \(h^2\) (cross-product or squared difference) using HE-reg based on family IBS across 81 quantitative traits and 251 metabolism traits
Square of difference:
Corss-product:
Square of difference:
Corss-product:
\(y_{\text{sib}}=a+bx+e\), here \(y_{\text{sib}}\) means the phenotype of the other sib.
We can now check the results between Single-sib and Alternative sib:
I have completed the comparison between IBS GWAS based on square difference and single-sib GWAS:
Height
BMI
AgeDiabetes