Covariance of IBS between two loci

理论部分

研究一对sib在两个位点\(k\)和\(l\)的IBS得分乘积的期望，具体过程如下：

先看第一个位点k，其三种基因型为AA，Aa，aa，其中A等位基因的频率为p1，a等位基因的频率为q1，我们用下标p表示父亲，下标m表示母亲，我们可以获得9种交配类型，其分布律为：

根据不同的Mating type，我们可以获得这些交配模式下的Sib pair的基因型及其概率。

我们单独取出其中一种Mating type来看为\(A_pA_p\times A_mA_m\)，此时，第一个位点k的sib pair只有一种类型，即\(\{A_p A_m,A_p A_m\}\)。对于亲本第二个位点l，我们假设其有两个等位基因B和b，等位基因频率分别为p2和q2。亲本向下传递配子时，第二个位点的频率分布取决于两个位点之间的重组率\(c\)与连锁不平衡程度\(D\)。当一条单倍型不重组时，第二个位点的等位基因频率分布取决于同一条单倍型上k位点的等位基因，即\(P(第二个位点等位基因|同一条单倍型上第一个位点的等位基因)\)服从如下分布：

当重组发生时，第二个位点的等位基因频率分布取决于另一条单倍型上k位点的等位基因，即\(P(第二个位点等位基因|另一条单倍型上第一个位点的等位基因)\)，该条件概率同样服从上表。

这样我们就可以给出亲本在四条单倍型上l位点的分布，由此分布我们就可以给出父亲和目前在第l位点基因型（BB、Bb、bB、bb）的分布律（概率等于上面条件概率两两组合的乘积）。

之后，我们就可以由亲本抽样产生配子进而组合形成sib pair。由于k位点的sib pair的基因型是确定的，我们可以基于每个sib在k位点的等位基因状态来决定每个sib在l位点的分布情况。

比如，一个sib在k位点的父本allele为\(A_p\)，由于此时父亲在k位点的基因型为\(A_pA_p\)，由于基因型纯合，所以该allele继承自两条haplotype的概率为50/50（无论是否重组都是这样的状态），此时在该haplotype上l位点的allele为B的概率为\(p_B=0.5\times p(\text{父亲第一条haplotype的第二个位点为B}) + 0.5\times p(\text{父亲第二条haplotype的第二个位点为B}))\)，而为b的概率为\(1-p_B\)。

而假设父亲在k位点的基因型为杂合（\(A_pa_p\)或者\(a_pA_p\)）时，则一旦sib在k位点的allele确定，也就确定了它是来自亲本哪条haplotype，此时如果不发生重组，则sib在l位点就取决于亲本该条haplotype在l位点上的allele（B或者b），如果发生重组，则sib在l位点就取决于亲本另一条haplotype在l位点上的allele（B或者b）。

通过这种方式，我们就可以进而确定该sib在l位点的基因型（BB、Bb、bB、bb）的分布律（概率等于上面对应概率两两组合的乘积）。同样地，也可以决定另一个sib在l位点的不同基因型的分布律。

之后我们可以依次遍历两个sib在l位点的基因型组合，并计算出每种组合的概率。

确定了两个位点的基因型，我们可以根据两个Sib在第一个位点和第二个位点的同态一致性（IBS1与IBS2）得分，并计算其乘积，对于一个个体的第一个位点（\(k\)）来说，其三种基因型校正后的基因型值如下表所示：

类似地，对于一个个体的第二个位点（\(l\)）来说，其三种基因型校正后的基因型值如下表所示：

结合上面2个表，我们就可以计算出两个Sib1与Sib2之间在两个位点上IBS得分的乘积。

整体计算流程（伪代码）如下：

第一层循环：遍历 mating types

每个mating type都有对应的概率
每个 mating type 下，枚举该类型下的所有同胞k位点联合基因型对（sib_pairs），及其概率。

第二层循环：遍历重组状态（亲本四条单倍型）

四条haplotype，每个都可能发生重组和不重组，所以就有16种可能，每种的概率可以很轻易的计算。

第三层：在“sib pair × 重组状态”处，构造 l 位点联合分布

父亲两条染色体在 k 位点的allele已知（来自 mating types 的父型），但在 l 位点的具体allele（B/b）不直接已知，需要由条件概率表给出两条染色体的条件概率。
我们枚举父亲的 l 单倍型配置（四种：BB、Bb、bB、bb），并用条件概率的乘积给每种配置一个概率（两条染色体独立取样）；母亲同理。

第四层：在每个具体的“父母 l 单倍型配置 × 重组格局”下，计算两个同胞在 l 位点的联合分布

对于 sib1：结合其在k位点两个allele与父母的单倍型情况，按照前面所介绍的抽样方式获得其在l位点的基因型分布。
对于 sib2同理。
计算两个sib在 l 位点的联合分布 = s1_L_dist × s2_L_dist

第五层：累积到期望

此时的概率 = mating_freq × sib_prob × recomb_prob × 亲本l位点单倍型配置概率 × sib pair在l位点的联合分布概率
计算两个位点的IBS得分的乘积。

Numeric simulation

下面举一个例子来演示整个过程，假设两个位点完全连锁，重组率为0，此时两位点IBS得分乘积的期望理论上等于1.25。

参数初始化

两个位点，重组率0，连锁不平衡程度\(D=0.25\)，两个位点的等位基因频率为\(p1=p2=0.5\)

p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25 # 这样两位点就完全独立

定义辅助函数

# 校正后的基因型值计算函数（保留原逻辑）
corrected_value <- function(genotype, p) {
  if (genotype == "AA" | genotype == "BB") {
    (2 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype == "Aa" | genotype == "Bb" | genotype == "aA" | genotype == "bB") {
    (1 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype == "aa" | genotype == "bb") {
    (0 - 2*p) / sqrt(2 * p * (1 - p))
  } else {
    stop("Invalid genotype")
  }
}

# 位点l的条件概率函数
cond_L_given_K <- function(k_allele1,k_allele2,is_recombination) {
  if (is_recombination) {
    allele<-k_allele2
    if (allele == "A") {
      B_prob <- p2 + D / p1
      b_prob <- q2 - D / p1
    } else if (allele == "a") {
      B_prob <- p2 - D / q1
      b_prob <- q2 + D / q1
    } else {
      stop("Invalid allele")
    }
    return(c(B = B_prob, b = b_prob))
  } else {
    allele<-k_allele1
    if (allele == "A") {
      B_prob <- p2 + D / p1
      b_prob <- q2 - D / p1
    } else if (allele == "a") {
      B_prob <- p2 - D / q1
      b_prob <- q2 + D / q1
    } else {
      stop("Invalid allele")
    }
    return(c(B = B_prob, b = b_prob))
  }
}

# 在给定亲本两条染色体的 K/L 等位基因与是否重组时，sib从该家长在 L 位点的 B/b 概率
# - 若亲本在 K 位点纯合（AA 或 aa），孩子在K位点选到哪条染色体是 50/50；是否重组对 L 的来源等价于另一条染色体，整体仍为两条的均匀混合；
# - 若亲本在 K 位点杂合（Aa / aA），孩子在 K 位点的字母唯一对应某一条染色体；是否重组决定 L 来自同一条或另一条染色体。
child_parent_L_probs <- function(chosen_k_letter, parent_k_h1, parent_k_h2, parent_L_h1, parent_L_h2, recomb_flag) {
  if (parent_k_h1 == parent_k_h2) {
    # 纯合：孩子在K位点选到哪条染色体是50/50；重组与否都导致L来源两条均匀混合
    pB <- 0.5 * (parent_L_h1 == "B") + 0.5 * (parent_L_h2 == "B")
    return(c(B = pB, b = 1 - pB))
  }
  # 杂合：孩子在K位点的字母唯一对应某一条染色体
  source_idx <- if (chosen_k_letter == parent_k_h1) 1 else if (chosen_k_letter == parent_k_h2) 2 else stop("Chosen K not in parent haplotypes")
  L_idx <- if (recomb_flag) (3 - source_idx) else source_idx
  L_letter <- if (L_idx == 1) parent_L_h1 else parent_L_h2
  if (L_letter == "B") c(B = 1, b = 0) else c(B = 0, b = 1)
}

定义所有的交配组合及其子代的Sib pair组合的概率

mating_types <- list(
  list(parents = c("AA", "AA"), frequency = p1^4,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 1))),
  list(parents = c("AA", "Aa"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "AA"), prob = 0.25),
                        list(genotypes = c("Aa", "Aa"), prob = 0.25)
       )),
  list(parents = c("AA", "aA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "AA"), prob = 0.25),
                        list(genotypes = c("Aa", "Aa"), prob = 0.25)
       )),
  list(parents = c("AA", "aa"), frequency = p1^2*q1^2,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 1)
       )),
  list(parents = c("Aa", "AA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "AA"), prob = 0.25),
                        list(genotypes = c("aA", "aA"), prob = 0.25)
       )),
  list(parents = c("Aa", "Aa"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  list(parents = c("Aa", "aA"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  
  list(parents = c("Aa", "aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "Aa"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aA", "AA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "AA"), prob = 0.25),
                        list(genotypes = c("aA", "aA"), prob = 0.25)
       )),
  list(parents = c("aA", "Aa"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  list(parents = c("aA", "aA"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  
  list(parents = c("aA", "aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "Aa"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "AA"), frequency = p1^2*q1^2,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 1))),
  
  list(parents = c("aa", "Aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "aA"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "aA"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "aA"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "aa"), frequency = q1^4,
       sib_pairs = list(list(genotypes = c("aa", "aa"), prob = 1)))
)

遍历所有情况

先定义结果存储的变量：

# 结果保存表格
result_df <- data.frame(
  MatingType = character(),
  SibTypeK = character(),
  SibTypeL = character(),
  RecombinationType = character(),
  TotalProbability = numeric(),
  OmegaK = numeric(),
  OmegaL = numeric(),
  OmegaProduct = numeric(),
  TotalExpection = numeric(),
  stringsAsFactors = FALSE
)


total_expectation <- 0

再遍历所有情况

for (mating in mating_types) {
  mating_freq <- mating$frequency
  mating_type_str <- paste(mating$parents, collapse = "x")
  father_geno<-mating$parents[1]
  mother_geno<-mating$parents[2]
  
  father_allele1_k<-substr(father_geno,1,1)
  father_allele2_k<-substr(father_geno,2,2)
  mother_allele1_k<-substr(mother_geno,1,1)
  mother_allele2_k<-substr(mother_geno,2,2)
  
  # 遍历所有重组组合（f1, f2, m1, m2）
  recomb <- expand.grid(f1 = c(TRUE, FALSE), m1 = c(TRUE, FALSE),
                        f2 = c(TRUE, FALSE), m2 = c(TRUE, FALSE))
  recomb$prob<-NA
  
  for (i in 1:nrow(recomb)) {
    rc <- recomb[i, 1:4]
    recomb_prob <- (c^sum(unlist(rc))) * ((1 - c)^(4 - sum(unlist(rc))))
    recomb$prob[i] <- recomb_prob
  }
  
  # 每个同胞对在每种重组情形下，先枚举父母L单倍型配置，再构建该同胞对的L联合分布并累计期望
  for (sib_pair in mating$sib_pairs) {
    sib_type_str1 <- paste(sib_pair$genotypes, collapse = "_")
    sib1_k <- sib_pair$genotypes[1]
    sib2_k <- sib_pair$genotypes[2]
    sib_prob <- sib_pair$prob
    
    # K 位点 IBS 乘积
    sib1_k_score <- corrected_value(sib1_k, p1)
    sib2_k_score <- corrected_value(sib2_k, p1)
    k_product <- sib1_k_score * sib2_k_score
    
    # K 位点父/母来源字母
    s1_f_k <- substr(sib1_k, 1, 1); s1_m_k <- substr(sib1_k, 2, 2)
    s2_f_k <- substr(sib2_k, 1, 1); s2_m_k <- substr(sib2_k, 2, 2)
    
    for (i in 1:nrow(recomb)) {
      rc <- recomb[i, 1:4]
      recomb_prob <- recomb$prob[i]
      recomb_type <- sprintf(
        "Father:%s%s|Mother:%s%s",
        ifelse(rc$f1, "R", "N"), 
        ifelse(rc$f2, "R", "N"), 
        ifelse(rc$m1, "R", "N"), 
        ifelse(rc$m2, "R", "N")
      )
      
      # 父母各自两条染色体在 L 位点的B/b条件概率（仅依赖各自染色体在K位点的字母）
      father_h1_Lp <- cond_L_given_K(father_allele1_k, father_allele2_k, rc$f1)
      father_h2_Lp <- cond_L_given_K(father_allele2_k, father_allele1_k, rc$f2)
      mother_h1_Lp <- cond_L_given_K(mother_allele1_k, mother_allele2_k, rc$m1)
      mother_h2_Lp <- cond_L_given_K(mother_allele2_k, mother_allele1_k, rc$m2)
      
      # 枚举父亲/母亲的 L 单倍型配置（这一步将父母的L字母具体化，供两个同胞共享）
      father_haplo_configs <- list(
        list(l_letters = c("B","B"), prob = father_h1_Lp["B"] * father_h2_Lp["B"]),
        list(l_letters = c("B","b"), prob = father_h1_Lp["B"] * father_h2_Lp["b"]),
        list(l_letters = c("b","B"), prob = father_h1_Lp["b"] * father_h2_Lp["B"]),
        list(l_letters = c("b","b"), prob = father_h1_Lp["b"] * father_h2_Lp["b"])
      )
      mother_haplo_configs <- list(
        list(l_letters = c("B","B"), prob = mother_h1_Lp["B"] * mother_h2_Lp["B"]),
        list(l_letters = c("B","b"), prob = mother_h1_Lp["B"] * mother_h2_Lp["b"]),
        list(l_letters = c("b","B"), prob = mother_h1_Lp["b"] * mother_h2_Lp["B"]),
        list(l_letters = c("b","b"), prob = mother_h1_Lp["b"] * mother_h2_Lp["b"])
      )
      
      # 为该同胞对 + 该重组格局构建 L 位点联合分布
      l_types <- list(
        list(genotypes = c("BB", "BB"),prob=0),
        list(genotypes = c("BB", "Bb"),prob=0),
        list(genotypes = c("BB", "bB"),prob=0),
        list(genotypes = c("BB", "bb"),prob=0),
        list(genotypes = c("Bb", "BB"),prob=0),
        list(genotypes = c("Bb", "Bb"),prob=0),
        list(genotypes = c("Bb", "bB"),prob=0),
        list(genotypes = c("Bb", "bb"),prob=0),
        list(genotypes = c("bB", "BB"),prob=0),
        list(genotypes = c("bB", "Bb"),prob=0),
        list(genotypes = c("bB", "bB"),prob=0),
        list(genotypes = c("bB", "bb"),prob=0),
        list(genotypes = c("bb", "BB"),prob=0),
        list(genotypes = c("bb", "Bb"),prob=0),
        list(genotypes = c("bb", "bB"),prob=0),
        list(genotypes = c("bb", "bb"),prob=0)
      )
      
      # 通过父母 L 单倍型配置推导两个同胞的 L 联合分布
      for (father_haplo in father_haplo_configs) {
        for (mother_haplo in mother_haplo_configs) {
          config_prob <- father_haplo$prob * mother_haplo$prob
          
          # sib1：从父母各自获得的 L 位点 B/b 概率（取决于 K 来源与是否重组，以及父母两条染色体在 L 的具体状态）
          s1_f_L <- child_parent_L_probs(s1_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f1)
          s1_m_L <- child_parent_L_probs(s1_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m1)
          s1_L_dist <- list(
            BB = s1_f_L["B"] * s1_m_L["B"],
            Bb = s1_f_L["B"] * s1_m_L["b"],
            bB = s1_f_L["b"] * s1_m_L["B"],
            bb = s1_f_L["b"] * s1_m_L["b"]
          )
          
          # sib2 同理
          s2_f_L <- child_parent_L_probs(s2_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f2)
          s2_m_L <- child_parent_L_probs(s2_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m2)
          s2_L_dist <- list(
            BB = s2_f_L["B"] * s2_m_L["B"],
            Bb = s2_f_L["B"] * s2_m_L["b"],
            bB = s2_f_L["b"] * s2_m_L["B"],
            bb = s2_f_L["b"] * s2_m_L["b"]
          )
          
          # 合并成同胞对在 L 位点的联合分布，并累加到 l_types
          for (s1_l in names(s1_L_dist)) {
            for (s2_l in names(s2_L_dist)) {
              joint_prob_l <- as.numeric(s1_L_dist[[s1_l]]) * as.numeric(s2_L_dist[[s2_l]]) * config_prob
              for (idx in 1:length(l_types)) {
                if (l_types[[idx]]$genotypes[1] == s1_l && l_types[[idx]]$genotypes[2] == s2_l) {
                  l_types[[idx]]$prob <- l_types[[idx]]$prob + joint_prob_l
                }
              }
            }
          }
        }
      }
      
      # 用该同胞对的 l_types 来累加期望，并记录到结果表
      for (l_type in l_types) {
        sib_type_str2 <- paste(l_type$genotypes, collapse = "_")
        joint_prob_l <- l_type$prob
        
        # 计算 l 位点 IBS 得分乘积
        s1_l <- l_type$genotypes[1]
        s2_l <- l_type$genotypes[2]
        s1_l_score <- corrected_value(s1_l, p2)
        s2_l_score <- corrected_value(s2_l, p2)
        l_product <- s1_l_score * s2_l_score
        
        # 计算总概率和总乘积
        total_prob <- mating_freq * sib_prob * recomb_prob * joint_prob_l
        total_product <- k_product * l_product
        
        # 累加到总期望，并记录到结果表
        total_expectation <- total_expectation + total_product * total_prob
        
        new_row <- data.frame(
          MatingType = mating_type_str,
          SibTypeK = sib_type_str1,
          SibTypeL = sib_type_str2,
          RecombinationType = recomb_type,
          TotalProbability = total_prob,
          OmegaK = k_product,
          OmegaL = l_product,
          OmegaProduct = total_product,
          TotalExpection = total_product * total_prob
        )
        result_df <- rbind(result_df, new_row)
      }
    }
  }
}

展示结果:

cat("期望值 E(Ωk Ωl) =", total_expectation, "\n")

## 期望值 E(Ωk Ωl) = 1.25

抽样模拟验证

我们使用模拟出不同的单倍型然后进行抽样的方式来检验我们的numerical simulation是否符合预期。

参数初始化

保持与numerical模拟相同即可

set.seed(123)
p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25

之后，我们可以基于等位基因频率计算出四种单倍型的频率，为后续产生配子库做准备

# 单体型频率计算
haplo_freq <- c(p1*p2 + D, p1*q2 - D, q1*p2 - D, q1*q2 + D)
names(haplo_freq) <- c("AB", "Ab", "aB", "ab")

# 单体型定义数据框
haplo_df <- data.frame(
  haplo = names(haplo_freq),
  k = substr(names(haplo_freq), 1, 1),
  l = substr(names(haplo_freq), 2, 2),
  prob = haplo_freq
)

定义辅助函数

# 校正基因型值函数
corrected_value <- function(genotype, p) {
  if (genotype %in% c("AA", "BB")) {
    (2 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype %in% c("Aa", "Bb")) {
    (1 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype %in% c("aa", "bb")) {
    (0 - 2*p) / sqrt(2 * p * (1 - p))
  } else {
    stop("无效基因型")
  }
}

# 配子生成函数：从亲本的单倍型中按照重组率抽取单倍型
generate_gamete <- function(parent_haplos, recomb) {
  is_recomb <- runif(1) < recomb # 判断是否重组
  if (is_recomb) { # 如果重组
    if(runif(1)<0.5){
      gamete <- c(
        k = parent_haplos$k[1],
        l = parent_haplos$l[2],
        recomb = TRUE
      )
    }else{
      gamete <- c(
        k = parent_haplos$k[2],
        l = parent_haplos$l[1],
        recomb = TRUE
      )
    }
  } else {
    idx <- sample(1:2, 1)
    gamete <- c(
      k = parent_haplos$k[idx],
      l = parent_haplos$l[idx],
      recomb = FALSE
    )
  }
  return(as.list(gamete))
}

# 基因型格式化
format_geno <- function(alleles) {
  sorted <- rev(sort(alleles))
  paste0(sorted, collapse = "")
}

对各个家庭的模拟

我们模拟1000个家庭，先定义一个模拟家庭的函数：

simulate_family <- function(father, mother, recomb) {
  # 生成Sib1配子
  f_gamete1 <- generate_gamete(father, recomb)
  m_gamete1 <- generate_gamete(mother, recomb)
  
  # 生产Sib2配子
  f_gamete2 <- generate_gamete(father, recomb)
  m_gamete2 <- generate_gamete(mother, recomb)
  
  # 生成子代基因型
  sib1_k <- format_geno(c(f_gamete1$k, m_gamete1$k))
  sib1_l <- format_geno(c(f_gamete1$l, m_gamete1$l))
  sib2_k <- format_geno(c(f_gamete2$k, m_gamete2$k))
  sib2_l <- format_geno(c(f_gamete2$l, m_gamete2$l))
  
  # 确定重组类型
  recomb_type <- paste0(
    "S1:",
    ifelse(f_gamete1$recomb,"R","N"),
    ifelse(m_gamete1$recomb,"R","N"),
    "|",
    "S2:",
    ifelse(f_gamete2$recomb,"R","N"),
    ifelse(m_gamete2$recomb,"R","N")
  )
  
  # 计算Omega值
  omega_k <- corrected_value(sib1_k, p1) * corrected_value(sib2_k, p1)
  omega_l <- corrected_value(sib1_l, p2) * corrected_value(sib2_l, p2)
  
  # Mating type
  mating_type <- paste(format_geno(c(father$k[1], father$k[2])),
                       format_geno(c(mother$k[1], mother$k[2])),
                       sep="x"
  )
  
  return(data.frame(
    MatingType = mating_type,
    SibTypeK = paste(c(sib1_k, sib2_k), collapse = "_"),
    SibTypeL = paste(c(sib1_l, sib2_l), collapse = "_"),
    RecombinationType = recomb_type,
    OmegaK = omega_k,
    OmegaL = omega_l,
    OmegaProduct = omega_k * omega_l
  ))
}

然后再模拟每个家庭，将所有结果整合起来

# 批量模拟1000家庭
n_families<-1000
results <- vector("list", n_families)
for(f in 1:n_families){
  # 生成父母单体型
  father <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
  mother <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
  
  # 模拟家庭
  res <- simulate_family(father, mother, c)
  results[[f]] <- res
}

results<-rbindlist(results)
results_summary <- results %>%
  group_by(MatingType, SibTypeK, SibTypeL, RecombinationType,
           OmegaK, OmegaL, OmegaProduct) %>%
  summarise(
    TotalProbability = dplyr::n() / n_families,
    .groups = 'drop'
  ) %>%
  mutate(
    TotalExpection = OmegaProduct * TotalProbability
  )

cat("期望值 E(Ωk Ωl) =", sum(results_summary$TotalExpection), "\n")

## 期望值 E(Ωk Ωl) = 1.124

可以看出这个数值与理论值还是比较接近的

十位点模拟

为了进一步验证我们模拟的准确性，我们做了10个位点两两之间的理论模拟与实际模拟，共10000个家庭，看下是否偏差依然存在

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

可以看出拟合得非常完美。

Covariance of IBS between two loci

Zhe Zhang

2025-09-30

理论部分

Numeric simulation

参数初始化

定义辅助函数

定义所有的交配组合及其子代的Sib pair组合的概率

遍历所有情况

抽样模拟验证

参数初始化

定义辅助函数

对各个家庭的模拟

十位点模拟