理论部分

研究一对sib在两个位点\(k\)\(l\)的IBS得分乘积的期望,具体过程如下:

先看第一个位点k,其三种基因型为AA,Aa,aa,其中A等位基因的频率为p1,a等位基因的频率为q1,我们用下标p表示父亲,下标m表示母亲,我们可以获得9种交配类型,其分布律为:

根据不同的Mating type,我们可以获得这些交配模式下的Sib pair的基因型及其概率。

我们单独取出其中一种Mating type来看为\(A_pA_p\times A_mA_m\),此时,第一个位点k的sib pair只有一种类型,即\(\{A_p A_m,A_p A_m\}\)。对于亲本第二个位点l,我们假设其有两个等位基因B和b,等位基因频率分别为p2和q2。亲本向下传递配子时,第二个位点的频率分布取决于两个位点之间的重组率\(c\)与连锁不平衡程度\(D\)。当一条单倍型不重组时,第二个位点的等位基因频率分布取决于同一条单倍型上k位点的等位基因,即\(P(第二个位点等位基因|同一条单倍型上第一个位点的等位基因)\)服从如下分布:

当重组发生时,第二个位点的等位基因频率分布取决于另一条单倍型上k位点的等位基因,即\(P(第二个位点等位基因|另一条单倍型上第一个位点的等位基因)\),该条件概率同样服从上表。

这样我们就可以给出亲本在四条单倍型上l位点的分布,由此分布我们就可以给出父亲和目前在第l位点基因型(BB、Bb、bB、bb)的分布律(概率等于上面条件概率两两组合的乘积)。

之后,我们就可以由亲本抽样产生配子进而组合形成sib pair。由于k位点的sib pair的基因型是确定的,我们可以基于每个sib在k位点的等位基因状态来决定每个sib在l位点的分布情况。

比如,一个sib在k位点的父本allele为\(A_p\),由于此时父亲在k位点的基因型为\(A_pA_p\),由于基因型纯合,所以该allele继承自两条haplotype的概率为50/50(无论是否重组都是这样的状态),此时在该haplotype上l位点的allele为B的概率为\(p_B=0.5\times p(\text{父亲第一条haplotype的第二个位点为B}) + 0.5\times p(\text{父亲第二条haplotype的第二个位点为B}))\),而为b的概率为\(1-p_B\)

而假设父亲在k位点的基因型为杂合(\(A_pa_p\)或者\(a_pA_p\))时,则一旦sib在k位点的allele确定,也就确定了它是来自亲本哪条haplotype,此时如果不发生重组,则sib在l位点就取决于亲本该条haplotype在l位点上的allele(B或者b),如果发生重组,则sib在l位点就取决于亲本另一条haplotype在l位点上的allele(B或者b)。

通过这种方式,我们就可以进而确定该sib在l位点的基因型(BB、Bb、bB、bb)的分布律(概率等于上面对应概率两两组合的乘积)。同样地,也可以决定另一个sib在l位点的不同基因型的分布律。

之后我们可以依次遍历两个sib在l位点的基因型组合,并计算出每种组合的概率。

确定了两个位点的基因型,我们可以根据两个Sib在第一个位点和第二个位点的同态一致性(IBS1与IBS2)得分,并计算其乘积,对于一个个体的第一个位点(\(k\))来说,其三种基因型校正后的基因型值如下表所示:

类似地,对于一个个体的第二个位点(\(l\))来说,其三种基因型校正后的基因型值如下表所示:

结合上面2个表,我们就可以计算出两个Sib1与Sib2之间在两个位点上IBS得分的乘积。

整体计算流程(伪代码)如下:

  1. 第一层循环:遍历 mating types
  • 每个mating type都有对应的概率
  • 每个 mating type 下,枚举该类型下的所有同胞k位点联合基因型对(sib_pairs),及其概率 。
  1. 第二层循环:遍历重组状态(亲本四条单倍型)
  • 四条haplotype,每个都可能发生重组和不重组,所以就有16种可能,每种的概率可以很轻易的计算。
  1. 第三层:在“sib pair × 重组状态”处,构造 l 位点联合分布
  • 父亲两条染色体在 k 位点的allele已知(来自 mating types 的父型),但在 l 位点的具体allele(B/b)不直接已知,需要由条件概率表给出两条染色体的条件概率。
  • 我们枚举父亲的 l 单倍型配置(四种:BB、Bb、bB、bb),并用条件概率的乘积给每种配置一个概率(两条染色体独立取样);母亲同理。
  1. 第四层:在每个具体的“父母 l 单倍型配置 × 重组格局”下,计算两个同胞在 l 位点的联合分布
  • 对于 sib1:结合其在k位点两个allele与父母的单倍型情况,按照前面所介绍的抽样方式获得其在l位点的基因型分布。
  • 对于 sib2同理。
  • 计算两个sib在 l 位点的联合分布 = s1_L_dist × s2_L_dist
  1. 第五层:累积到期望
  • 此时的概率 = mating_freq × sib_prob × recomb_prob × 亲本l位点单倍型配置概率 × sib pair在l位点的联合分布概率
  • 计算两个位点的IBS得分的乘积。

Numeric simulation

下面举一个例子来演示整个过程,假设两个位点完全连锁,重组率为0,此时两位点IBS得分乘积的期望理论上等于1.25。

参数初始化

两个位点,重组率0,连锁不平衡程度\(D=0.25\),两个位点的等位基因频率为\(p1=p2=0.5\)

p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25 # 这样两位点就完全独立

定义辅助函数

# 校正后的基因型值计算函数(保留原逻辑)
corrected_value <- function(genotype, p) {
  if (genotype == "AA" | genotype == "BB") {
    (2 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype == "Aa" | genotype == "Bb" | genotype == "aA" | genotype == "bB") {
    (1 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype == "aa" | genotype == "bb") {
    (0 - 2*p) / sqrt(2 * p * (1 - p))
  } else {
    stop("Invalid genotype")
  }
}

# 位点l的条件概率函数
cond_L_given_K <- function(k_allele1,k_allele2,is_recombination) {
  if (is_recombination) {
    allele<-k_allele2
    if (allele == "A") {
      B_prob <- p2 + D / p1
      b_prob <- q2 - D / p1
    } else if (allele == "a") {
      B_prob <- p2 - D / q1
      b_prob <- q2 + D / q1
    } else {
      stop("Invalid allele")
    }
    return(c(B = B_prob, b = b_prob))
  } else {
    allele<-k_allele1
    if (allele == "A") {
      B_prob <- p2 + D / p1
      b_prob <- q2 - D / p1
    } else if (allele == "a") {
      B_prob <- p2 - D / q1
      b_prob <- q2 + D / q1
    } else {
      stop("Invalid allele")
    }
    return(c(B = B_prob, b = b_prob))
  }
}

# 在给定亲本两条染色体的 K/L 等位基因与是否重组时,sib从该家长在 L 位点的 B/b 概率
# - 若亲本在 K 位点纯合(AA 或 aa),孩子在K位点选到哪条染色体是 50/50;是否重组对 L 的来源等价于另一条染色体,整体仍为两条的均匀混合;
# - 若亲本在 K 位点杂合(Aa / aA),孩子在 K 位点的字母唯一对应某一条染色体;是否重组决定 L 来自同一条或另一条染色体。
child_parent_L_probs <- function(chosen_k_letter, parent_k_h1, parent_k_h2, parent_L_h1, parent_L_h2, recomb_flag) {
  if (parent_k_h1 == parent_k_h2) {
    # 纯合:孩子在K位点选到哪条染色体是50/50;重组与否都导致L来源两条均匀混合
    pB <- 0.5 * (parent_L_h1 == "B") + 0.5 * (parent_L_h2 == "B")
    return(c(B = pB, b = 1 - pB))
  }
  # 杂合:孩子在K位点的字母唯一对应某一条染色体
  source_idx <- if (chosen_k_letter == parent_k_h1) 1 else if (chosen_k_letter == parent_k_h2) 2 else stop("Chosen K not in parent haplotypes")
  L_idx <- if (recomb_flag) (3 - source_idx) else source_idx
  L_letter <- if (L_idx == 1) parent_L_h1 else parent_L_h2
  if (L_letter == "B") c(B = 1, b = 0) else c(B = 0, b = 1)
}

定义所有的交配组合及其子代的Sib pair组合的概率

mating_types <- list(
  list(parents = c("AA", "AA"), frequency = p1^4,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 1))),
  list(parents = c("AA", "Aa"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "AA"), prob = 0.25),
                        list(genotypes = c("Aa", "Aa"), prob = 0.25)
       )),
  list(parents = c("AA", "aA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "AA"), prob = 0.25),
                        list(genotypes = c("Aa", "Aa"), prob = 0.25)
       )),
  list(parents = c("AA", "aa"), frequency = p1^2*q1^2,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 1)
       )),
  list(parents = c("Aa", "AA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "AA"), prob = 0.25),
                        list(genotypes = c("aA", "aA"), prob = 0.25)
       )),
  list(parents = c("Aa", "Aa"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  list(parents = c("Aa", "aA"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  
  list(parents = c("Aa", "aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "Aa"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aA", "AA"), frequency = p1^3*q1,
       sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
                        list(genotypes = c("AA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "AA"), prob = 0.25),
                        list(genotypes = c("aA", "aA"), prob = 0.25)
       )),
  list(parents = c("aA", "Aa"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  list(parents = c("aA", "aA"), frequency = p1^2*q1^2,
       sib_pairs = list(
         list(genotypes = c("AA", "AA"), prob = 0.0625),
         list(genotypes = c("AA", "Aa"), prob = 0.0625),
         list(genotypes = c("AA", "aA"), prob = 0.0625),
         list(genotypes = c("AA", "aa"), prob = 0.0625),
         
         list(genotypes = c("Aa", "AA"), prob = 0.0625),
         list(genotypes = c("Aa", "Aa"), prob = 0.0625),
         list(genotypes = c("Aa", "aA"), prob = 0.0625),
         list(genotypes = c("Aa", "aa"), prob = 0.0625),
         
         list(genotypes = c("aA", "AA"), prob = 0.0625),
         list(genotypes = c("aA", "Aa"), prob = 0.0625),
         list(genotypes = c("aA", "aA"), prob = 0.0625),
         list(genotypes = c("aA", "aa"), prob = 0.0625),
         
         list(genotypes = c("aa", "AA"), prob = 0.0625),
         list(genotypes = c("aa", "Aa"), prob = 0.0625),
         list(genotypes = c("aa", "aA"), prob = 0.0625),
         list(genotypes = c("aa", "aa"), prob = 0.0625)
       )),
  
  list(parents = c("aA", "aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
                        list(genotypes = c("Aa", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "Aa"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "AA"), frequency = p1^2*q1^2,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 1))),
  
  list(parents = c("aa", "Aa"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "aA"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "aA"), frequency = p1*q1^3,
       sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
                        list(genotypes = c("aA", "aa"), prob = 0.25),
                        list(genotypes = c("aa", "aA"), prob = 0.25),
                        list(genotypes = c("aa", "aa"), prob = 0.25)
       )),
  
  list(parents = c("aa", "aa"), frequency = q1^4,
       sib_pairs = list(list(genotypes = c("aa", "aa"), prob = 1)))
)

遍历所有情况

先定义结果存储的变量:

# 结果保存表格
result_df <- data.frame(
  MatingType = character(),
  SibTypeK = character(),
  SibTypeL = character(),
  RecombinationType = character(),
  TotalProbability = numeric(),
  OmegaK = numeric(),
  OmegaL = numeric(),
  OmegaProduct = numeric(),
  TotalExpection = numeric(),
  stringsAsFactors = FALSE
)


total_expectation <- 0

再遍历所有情况

for (mating in mating_types) {
  mating_freq <- mating$frequency
  mating_type_str <- paste(mating$parents, collapse = "x")
  father_geno<-mating$parents[1]
  mother_geno<-mating$parents[2]
  
  father_allele1_k<-substr(father_geno,1,1)
  father_allele2_k<-substr(father_geno,2,2)
  mother_allele1_k<-substr(mother_geno,1,1)
  mother_allele2_k<-substr(mother_geno,2,2)
  
  # 遍历所有重组组合(f1, f2, m1, m2)
  recomb <- expand.grid(f1 = c(TRUE, FALSE), m1 = c(TRUE, FALSE),
                        f2 = c(TRUE, FALSE), m2 = c(TRUE, FALSE))
  recomb$prob<-NA
  
  for (i in 1:nrow(recomb)) {
    rc <- recomb[i, 1:4]
    recomb_prob <- (c^sum(unlist(rc))) * ((1 - c)^(4 - sum(unlist(rc))))
    recomb$prob[i] <- recomb_prob
  }
  
  # 每个同胞对在每种重组情形下,先枚举父母L单倍型配置,再构建该同胞对的L联合分布并累计期望
  for (sib_pair in mating$sib_pairs) {
    sib_type_str1 <- paste(sib_pair$genotypes, collapse = "_")
    sib1_k <- sib_pair$genotypes[1]
    sib2_k <- sib_pair$genotypes[2]
    sib_prob <- sib_pair$prob
    
    # K 位点 IBS 乘积
    sib1_k_score <- corrected_value(sib1_k, p1)
    sib2_k_score <- corrected_value(sib2_k, p1)
    k_product <- sib1_k_score * sib2_k_score
    
    # K 位点父/母来源字母
    s1_f_k <- substr(sib1_k, 1, 1); s1_m_k <- substr(sib1_k, 2, 2)
    s2_f_k <- substr(sib2_k, 1, 1); s2_m_k <- substr(sib2_k, 2, 2)
    
    for (i in 1:nrow(recomb)) {
      rc <- recomb[i, 1:4]
      recomb_prob <- recomb$prob[i]
      recomb_type <- sprintf(
        "Father:%s%s|Mother:%s%s",
        ifelse(rc$f1, "R", "N"), 
        ifelse(rc$f2, "R", "N"), 
        ifelse(rc$m1, "R", "N"), 
        ifelse(rc$m2, "R", "N")
      )
      
      # 父母各自两条染色体在 L 位点的B/b条件概率(仅依赖各自染色体在K位点的字母)
      father_h1_Lp <- cond_L_given_K(father_allele1_k, father_allele2_k, rc$f1)
      father_h2_Lp <- cond_L_given_K(father_allele2_k, father_allele1_k, rc$f2)
      mother_h1_Lp <- cond_L_given_K(mother_allele1_k, mother_allele2_k, rc$m1)
      mother_h2_Lp <- cond_L_given_K(mother_allele2_k, mother_allele1_k, rc$m2)
      
      # 枚举父亲/母亲的 L 单倍型配置(这一步将父母的L字母具体化,供两个同胞共享)
      father_haplo_configs <- list(
        list(l_letters = c("B","B"), prob = father_h1_Lp["B"] * father_h2_Lp["B"]),
        list(l_letters = c("B","b"), prob = father_h1_Lp["B"] * father_h2_Lp["b"]),
        list(l_letters = c("b","B"), prob = father_h1_Lp["b"] * father_h2_Lp["B"]),
        list(l_letters = c("b","b"), prob = father_h1_Lp["b"] * father_h2_Lp["b"])
      )
      mother_haplo_configs <- list(
        list(l_letters = c("B","B"), prob = mother_h1_Lp["B"] * mother_h2_Lp["B"]),
        list(l_letters = c("B","b"), prob = mother_h1_Lp["B"] * mother_h2_Lp["b"]),
        list(l_letters = c("b","B"), prob = mother_h1_Lp["b"] * mother_h2_Lp["B"]),
        list(l_letters = c("b","b"), prob = mother_h1_Lp["b"] * mother_h2_Lp["b"])
      )
      
      # 为该同胞对 + 该重组格局构建 L 位点联合分布
      l_types <- list(
        list(genotypes = c("BB", "BB"),prob=0),
        list(genotypes = c("BB", "Bb"),prob=0),
        list(genotypes = c("BB", "bB"),prob=0),
        list(genotypes = c("BB", "bb"),prob=0),
        list(genotypes = c("Bb", "BB"),prob=0),
        list(genotypes = c("Bb", "Bb"),prob=0),
        list(genotypes = c("Bb", "bB"),prob=0),
        list(genotypes = c("Bb", "bb"),prob=0),
        list(genotypes = c("bB", "BB"),prob=0),
        list(genotypes = c("bB", "Bb"),prob=0),
        list(genotypes = c("bB", "bB"),prob=0),
        list(genotypes = c("bB", "bb"),prob=0),
        list(genotypes = c("bb", "BB"),prob=0),
        list(genotypes = c("bb", "Bb"),prob=0),
        list(genotypes = c("bb", "bB"),prob=0),
        list(genotypes = c("bb", "bb"),prob=0)
      )
      
      # 通过父母 L 单倍型配置推导两个同胞的 L 联合分布
      for (father_haplo in father_haplo_configs) {
        for (mother_haplo in mother_haplo_configs) {
          config_prob <- father_haplo$prob * mother_haplo$prob
          
          # sib1:从父母各自获得的 L 位点 B/b 概率(取决于 K 来源与是否重组,以及父母两条染色体在 L 的具体状态)
          s1_f_L <- child_parent_L_probs(s1_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f1)
          s1_m_L <- child_parent_L_probs(s1_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m1)
          s1_L_dist <- list(
            BB = s1_f_L["B"] * s1_m_L["B"],
            Bb = s1_f_L["B"] * s1_m_L["b"],
            bB = s1_f_L["b"] * s1_m_L["B"],
            bb = s1_f_L["b"] * s1_m_L["b"]
          )
          
          # sib2 同理
          s2_f_L <- child_parent_L_probs(s2_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f2)
          s2_m_L <- child_parent_L_probs(s2_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m2)
          s2_L_dist <- list(
            BB = s2_f_L["B"] * s2_m_L["B"],
            Bb = s2_f_L["B"] * s2_m_L["b"],
            bB = s2_f_L["b"] * s2_m_L["B"],
            bb = s2_f_L["b"] * s2_m_L["b"]
          )
          
          # 合并成同胞对在 L 位点的联合分布,并累加到 l_types
          for (s1_l in names(s1_L_dist)) {
            for (s2_l in names(s2_L_dist)) {
              joint_prob_l <- as.numeric(s1_L_dist[[s1_l]]) * as.numeric(s2_L_dist[[s2_l]]) * config_prob
              for (idx in 1:length(l_types)) {
                if (l_types[[idx]]$genotypes[1] == s1_l && l_types[[idx]]$genotypes[2] == s2_l) {
                  l_types[[idx]]$prob <- l_types[[idx]]$prob + joint_prob_l
                }
              }
            }
          }
        }
      }
      
      # 用该同胞对的 l_types 来累加期望,并记录到结果表
      for (l_type in l_types) {
        sib_type_str2 <- paste(l_type$genotypes, collapse = "_")
        joint_prob_l <- l_type$prob
        
        # 计算 l 位点 IBS 得分乘积
        s1_l <- l_type$genotypes[1]
        s2_l <- l_type$genotypes[2]
        s1_l_score <- corrected_value(s1_l, p2)
        s2_l_score <- corrected_value(s2_l, p2)
        l_product <- s1_l_score * s2_l_score
        
        # 计算总概率和总乘积
        total_prob <- mating_freq * sib_prob * recomb_prob * joint_prob_l
        total_product <- k_product * l_product
        
        # 累加到总期望,并记录到结果表
        total_expectation <- total_expectation + total_product * total_prob
        
        new_row <- data.frame(
          MatingType = mating_type_str,
          SibTypeK = sib_type_str1,
          SibTypeL = sib_type_str2,
          RecombinationType = recomb_type,
          TotalProbability = total_prob,
          OmegaK = k_product,
          OmegaL = l_product,
          OmegaProduct = total_product,
          TotalExpection = total_product * total_prob
        )
        result_df <- rbind(result_df, new_row)
      }
    }
  }
}

展示结果:

cat("期望值 E(Ωk Ωl) =", total_expectation, "\n")
## 期望值 E(Ωk Ωl) = 1.25

抽样模拟验证

我们使用模拟出不同的单倍型然后进行抽样的方式来检验我们的numerical simulation是否符合预期。

参数初始化

保持与numerical模拟相同即可

set.seed(123)
p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25

之后,我们可以基于等位基因频率计算出四种单倍型的频率,为后续产生配子库做准备

# 单体型频率计算
haplo_freq <- c(p1*p2 + D, p1*q2 - D, q1*p2 - D, q1*q2 + D)
names(haplo_freq) <- c("AB", "Ab", "aB", "ab")

# 单体型定义数据框
haplo_df <- data.frame(
  haplo = names(haplo_freq),
  k = substr(names(haplo_freq), 1, 1),
  l = substr(names(haplo_freq), 2, 2),
  prob = haplo_freq
)

定义辅助函数

# 校正基因型值函数
corrected_value <- function(genotype, p) {
  if (genotype %in% c("AA", "BB")) {
    (2 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype %in% c("Aa", "Bb")) {
    (1 - 2*p) / sqrt(2 * p * (1 - p))
  } else if (genotype %in% c("aa", "bb")) {
    (0 - 2*p) / sqrt(2 * p * (1 - p))
  } else {
    stop("无效基因型")
  }
}

# 配子生成函数:从亲本的单倍型中按照重组率抽取单倍型
generate_gamete <- function(parent_haplos, recomb) {
  is_recomb <- runif(1) < recomb # 判断是否重组
  if (is_recomb) { # 如果重组
    if(runif(1)<0.5){
      gamete <- c(
        k = parent_haplos$k[1],
        l = parent_haplos$l[2],
        recomb = TRUE
      )
    }else{
      gamete <- c(
        k = parent_haplos$k[2],
        l = parent_haplos$l[1],
        recomb = TRUE
      )
    }
  } else {
    idx <- sample(1:2, 1)
    gamete <- c(
      k = parent_haplos$k[idx],
      l = parent_haplos$l[idx],
      recomb = FALSE
    )
  }
  return(as.list(gamete))
}

# 基因型格式化
format_geno <- function(alleles) {
  sorted <- rev(sort(alleles))
  paste0(sorted, collapse = "")
}

对各个家庭的模拟

我们模拟1000个家庭,先定义一个模拟家庭的函数:

simulate_family <- function(father, mother, recomb) {
  # 生成Sib1配子
  f_gamete1 <- generate_gamete(father, recomb)
  m_gamete1 <- generate_gamete(mother, recomb)
  
  # 生产Sib2配子
  f_gamete2 <- generate_gamete(father, recomb)
  m_gamete2 <- generate_gamete(mother, recomb)
  
  # 生成子代基因型
  sib1_k <- format_geno(c(f_gamete1$k, m_gamete1$k))
  sib1_l <- format_geno(c(f_gamete1$l, m_gamete1$l))
  sib2_k <- format_geno(c(f_gamete2$k, m_gamete2$k))
  sib2_l <- format_geno(c(f_gamete2$l, m_gamete2$l))
  
  # 确定重组类型
  recomb_type <- paste0(
    "S1:",
    ifelse(f_gamete1$recomb,"R","N"),
    ifelse(m_gamete1$recomb,"R","N"),
    "|",
    "S2:",
    ifelse(f_gamete2$recomb,"R","N"),
    ifelse(m_gamete2$recomb,"R","N")
  )
  
  # 计算Omega值
  omega_k <- corrected_value(sib1_k, p1) * corrected_value(sib2_k, p1)
  omega_l <- corrected_value(sib1_l, p2) * corrected_value(sib2_l, p2)
  
  # Mating type
  mating_type <- paste(format_geno(c(father$k[1], father$k[2])),
                       format_geno(c(mother$k[1], mother$k[2])),
                       sep="x"
  )
  
  return(data.frame(
    MatingType = mating_type,
    SibTypeK = paste(c(sib1_k, sib2_k), collapse = "_"),
    SibTypeL = paste(c(sib1_l, sib2_l), collapse = "_"),
    RecombinationType = recomb_type,
    OmegaK = omega_k,
    OmegaL = omega_l,
    OmegaProduct = omega_k * omega_l
  ))
}

然后再模拟每个家庭,将所有结果整合起来

# 批量模拟1000家庭
n_families<-1000
results <- vector("list", n_families)
for(f in 1:n_families){
  # 生成父母单体型
  father <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
  mother <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
  
  # 模拟家庭
  res <- simulate_family(father, mother, c)
  results[[f]] <- res
}

results<-rbindlist(results)
results_summary <- results %>%
  group_by(MatingType, SibTypeK, SibTypeL, RecombinationType,
           OmegaK, OmegaL, OmegaProduct) %>%
  summarise(
    TotalProbability = dplyr::n() / n_families,
    .groups = 'drop'
  ) %>%
  mutate(
    TotalExpection = OmegaProduct * TotalProbability
  )

cat("期望值 E(Ωk Ωl) =", sum(results_summary$TotalExpection), "\n")
## 期望值 E(Ωk Ωl) = 1.124

可以看出这个数值与理论值还是比较接近的

十位点模拟

为了进一步验证我们模拟的准确性,我们做了10个位点两两之间的理论模拟与实际模拟,共10000个家庭,看下是否偏差依然存在

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

可以看出拟合得非常完美。