研究一对sib在两个位点\(k\)和\(l\)的IBS得分乘积的期望,具体过程如下:
先看第一个位点k,其三种基因型为AA,Aa,aa,其中A等位基因的频率为p1,a等位基因的频率为q1,我们用下标p表示父亲,下标m表示母亲,我们可以获得9种交配类型,其分布律为:
根据不同的Mating type,我们可以获得这些交配模式下的Sib pair的基因型及其概率。
我们单独取出其中一种Mating type来看为\(A_pA_p\times A_mA_m\),此时,第一个位点k的sib pair只有一种类型,即\(\{A_p A_m,A_p A_m\}\)。对于亲本第二个位点l,我们假设其有两个等位基因B和b,等位基因频率分别为p2和q2。亲本向下传递配子时,第二个位点的频率分布取决于两个位点之间的重组率\(c\)与连锁不平衡程度\(D\)。当一条单倍型不重组时,第二个位点的等位基因频率分布取决于同一条单倍型上k位点的等位基因,即\(P(第二个位点等位基因|同一条单倍型上第一个位点的等位基因)\)服从如下分布:
当重组发生时,第二个位点的等位基因频率分布取决于另一条单倍型上k位点的等位基因,即\(P(第二个位点等位基因|另一条单倍型上第一个位点的等位基因)\),该条件概率同样服从上表。
这样我们就可以给出亲本在四条单倍型上l位点的分布,由此分布我们就可以给出父亲和目前在第l位点基因型(BB、Bb、bB、bb)的分布律(概率等于上面条件概率两两组合的乘积)。
之后,我们就可以由亲本抽样产生配子进而组合形成sib pair。由于k位点的sib pair的基因型是确定的,我们可以基于每个sib在k位点的等位基因状态来决定每个sib在l位点的分布情况。
比如,一个sib在k位点的父本allele为\(A_p\),由于此时父亲在k位点的基因型为\(A_pA_p\),由于基因型纯合,所以该allele继承自两条haplotype的概率为50/50(无论是否重组都是这样的状态),此时在该haplotype上l位点的allele为B的概率为\(p_B=0.5\times p(\text{父亲第一条haplotype的第二个位点为B}) + 0.5\times p(\text{父亲第二条haplotype的第二个位点为B}))\),而为b的概率为\(1-p_B\)。
而假设父亲在k位点的基因型为杂合(\(A_pa_p\)或者\(a_pA_p\))时,则一旦sib在k位点的allele确定,也就确定了它是来自亲本哪条haplotype,此时如果不发生重组,则sib在l位点就取决于亲本该条haplotype在l位点上的allele(B或者b),如果发生重组,则sib在l位点就取决于亲本另一条haplotype在l位点上的allele(B或者b)。
通过这种方式,我们就可以进而确定该sib在l位点的基因型(BB、Bb、bB、bb)的分布律(概率等于上面对应概率两两组合的乘积)。同样地,也可以决定另一个sib在l位点的不同基因型的分布律。
之后我们可以依次遍历两个sib在l位点的基因型组合,并计算出每种组合的概率。
确定了两个位点的基因型,我们可以根据两个Sib在第一个位点和第二个位点的同态一致性(IBS1与IBS2)得分,并计算其乘积,对于一个个体的第一个位点(\(k\))来说,其三种基因型校正后的基因型值如下表所示:
类似地,对于一个个体的第二个位点(\(l\))来说,其三种基因型校正后的基因型值如下表所示:
结合上面2个表,我们就可以计算出两个Sib1与Sib2之间在两个位点上IBS得分的乘积。
整体计算流程(伪代码)如下:
下面举一个例子来演示整个过程,假设两个位点完全连锁,重组率为0,此时两位点IBS得分乘积的期望理论上等于1.25。
两个位点,重组率0,连锁不平衡程度\(D=0.25\),两个位点的等位基因频率为\(p1=p2=0.5\)
p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25 # 这样两位点就完全独立
# 校正后的基因型值计算函数(保留原逻辑)
corrected_value <- function(genotype, p) {
if (genotype == "AA" | genotype == "BB") {
(2 - 2*p) / sqrt(2 * p * (1 - p))
} else if (genotype == "Aa" | genotype == "Bb" | genotype == "aA" | genotype == "bB") {
(1 - 2*p) / sqrt(2 * p * (1 - p))
} else if (genotype == "aa" | genotype == "bb") {
(0 - 2*p) / sqrt(2 * p * (1 - p))
} else {
stop("Invalid genotype")
}
}
# 位点l的条件概率函数
cond_L_given_K <- function(k_allele1,k_allele2,is_recombination) {
if (is_recombination) {
allele<-k_allele2
if (allele == "A") {
B_prob <- p2 + D / p1
b_prob <- q2 - D / p1
} else if (allele == "a") {
B_prob <- p2 - D / q1
b_prob <- q2 + D / q1
} else {
stop("Invalid allele")
}
return(c(B = B_prob, b = b_prob))
} else {
allele<-k_allele1
if (allele == "A") {
B_prob <- p2 + D / p1
b_prob <- q2 - D / p1
} else if (allele == "a") {
B_prob <- p2 - D / q1
b_prob <- q2 + D / q1
} else {
stop("Invalid allele")
}
return(c(B = B_prob, b = b_prob))
}
}
# 在给定亲本两条染色体的 K/L 等位基因与是否重组时,sib从该家长在 L 位点的 B/b 概率
# - 若亲本在 K 位点纯合(AA 或 aa),孩子在K位点选到哪条染色体是 50/50;是否重组对 L 的来源等价于另一条染色体,整体仍为两条的均匀混合;
# - 若亲本在 K 位点杂合(Aa / aA),孩子在 K 位点的字母唯一对应某一条染色体;是否重组决定 L 来自同一条或另一条染色体。
child_parent_L_probs <- function(chosen_k_letter, parent_k_h1, parent_k_h2, parent_L_h1, parent_L_h2, recomb_flag) {
if (parent_k_h1 == parent_k_h2) {
# 纯合:孩子在K位点选到哪条染色体是50/50;重组与否都导致L来源两条均匀混合
pB <- 0.5 * (parent_L_h1 == "B") + 0.5 * (parent_L_h2 == "B")
return(c(B = pB, b = 1 - pB))
}
# 杂合:孩子在K位点的字母唯一对应某一条染色体
source_idx <- if (chosen_k_letter == parent_k_h1) 1 else if (chosen_k_letter == parent_k_h2) 2 else stop("Chosen K not in parent haplotypes")
L_idx <- if (recomb_flag) (3 - source_idx) else source_idx
L_letter <- if (L_idx == 1) parent_L_h1 else parent_L_h2
if (L_letter == "B") c(B = 1, b = 0) else c(B = 0, b = 1)
}
mating_types <- list(
list(parents = c("AA", "AA"), frequency = p1^4,
sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 1))),
list(parents = c("AA", "Aa"), frequency = p1^3*q1,
sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
list(genotypes = c("AA", "Aa"), prob = 0.25),
list(genotypes = c("Aa", "AA"), prob = 0.25),
list(genotypes = c("Aa", "Aa"), prob = 0.25)
)),
list(parents = c("AA", "aA"), frequency = p1^3*q1,
sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
list(genotypes = c("AA", "Aa"), prob = 0.25),
list(genotypes = c("Aa", "AA"), prob = 0.25),
list(genotypes = c("Aa", "Aa"), prob = 0.25)
)),
list(parents = c("AA", "aa"), frequency = p1^2*q1^2,
sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 1)
)),
list(parents = c("Aa", "AA"), frequency = p1^3*q1,
sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
list(genotypes = c("AA", "aA"), prob = 0.25),
list(genotypes = c("aA", "AA"), prob = 0.25),
list(genotypes = c("aA", "aA"), prob = 0.25)
)),
list(parents = c("Aa", "Aa"), frequency = p1^2*q1^2,
sib_pairs = list(
list(genotypes = c("AA", "AA"), prob = 0.0625),
list(genotypes = c("AA", "Aa"), prob = 0.0625),
list(genotypes = c("AA", "aA"), prob = 0.0625),
list(genotypes = c("AA", "aa"), prob = 0.0625),
list(genotypes = c("Aa", "AA"), prob = 0.0625),
list(genotypes = c("Aa", "Aa"), prob = 0.0625),
list(genotypes = c("Aa", "aA"), prob = 0.0625),
list(genotypes = c("Aa", "aa"), prob = 0.0625),
list(genotypes = c("aA", "AA"), prob = 0.0625),
list(genotypes = c("aA", "Aa"), prob = 0.0625),
list(genotypes = c("aA", "aA"), prob = 0.0625),
list(genotypes = c("aA", "aa"), prob = 0.0625),
list(genotypes = c("aa", "AA"), prob = 0.0625),
list(genotypes = c("aa", "Aa"), prob = 0.0625),
list(genotypes = c("aa", "aA"), prob = 0.0625),
list(genotypes = c("aa", "aa"), prob = 0.0625)
)),
list(parents = c("Aa", "aA"), frequency = p1^2*q1^2,
sib_pairs = list(
list(genotypes = c("AA", "AA"), prob = 0.0625),
list(genotypes = c("AA", "Aa"), prob = 0.0625),
list(genotypes = c("AA", "aA"), prob = 0.0625),
list(genotypes = c("AA", "aa"), prob = 0.0625),
list(genotypes = c("Aa", "AA"), prob = 0.0625),
list(genotypes = c("Aa", "Aa"), prob = 0.0625),
list(genotypes = c("Aa", "aA"), prob = 0.0625),
list(genotypes = c("Aa", "aa"), prob = 0.0625),
list(genotypes = c("aA", "AA"), prob = 0.0625),
list(genotypes = c("aA", "Aa"), prob = 0.0625),
list(genotypes = c("aA", "aA"), prob = 0.0625),
list(genotypes = c("aA", "aa"), prob = 0.0625),
list(genotypes = c("aa", "AA"), prob = 0.0625),
list(genotypes = c("aa", "Aa"), prob = 0.0625),
list(genotypes = c("aa", "aA"), prob = 0.0625),
list(genotypes = c("aa", "aa"), prob = 0.0625)
)),
list(parents = c("Aa", "aa"), frequency = p1*q1^3,
sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
list(genotypes = c("Aa", "aa"), prob = 0.25),
list(genotypes = c("aa", "Aa"), prob = 0.25),
list(genotypes = c("aa", "aa"), prob = 0.25)
)),
list(parents = c("aA", "AA"), frequency = p1^3*q1,
sib_pairs = list(list(genotypes = c("AA", "AA"), prob = 0.25),
list(genotypes = c("AA", "aA"), prob = 0.25),
list(genotypes = c("aA", "AA"), prob = 0.25),
list(genotypes = c("aA", "aA"), prob = 0.25)
)),
list(parents = c("aA", "Aa"), frequency = p1^2*q1^2,
sib_pairs = list(
list(genotypes = c("AA", "AA"), prob = 0.0625),
list(genotypes = c("AA", "Aa"), prob = 0.0625),
list(genotypes = c("AA", "aA"), prob = 0.0625),
list(genotypes = c("AA", "aa"), prob = 0.0625),
list(genotypes = c("Aa", "AA"), prob = 0.0625),
list(genotypes = c("Aa", "Aa"), prob = 0.0625),
list(genotypes = c("Aa", "aA"), prob = 0.0625),
list(genotypes = c("Aa", "aa"), prob = 0.0625),
list(genotypes = c("aA", "AA"), prob = 0.0625),
list(genotypes = c("aA", "Aa"), prob = 0.0625),
list(genotypes = c("aA", "aA"), prob = 0.0625),
list(genotypes = c("aA", "aa"), prob = 0.0625),
list(genotypes = c("aa", "AA"), prob = 0.0625),
list(genotypes = c("aa", "Aa"), prob = 0.0625),
list(genotypes = c("aa", "aA"), prob = 0.0625),
list(genotypes = c("aa", "aa"), prob = 0.0625)
)),
list(parents = c("aA", "aA"), frequency = p1^2*q1^2,
sib_pairs = list(
list(genotypes = c("AA", "AA"), prob = 0.0625),
list(genotypes = c("AA", "Aa"), prob = 0.0625),
list(genotypes = c("AA", "aA"), prob = 0.0625),
list(genotypes = c("AA", "aa"), prob = 0.0625),
list(genotypes = c("Aa", "AA"), prob = 0.0625),
list(genotypes = c("Aa", "Aa"), prob = 0.0625),
list(genotypes = c("Aa", "aA"), prob = 0.0625),
list(genotypes = c("Aa", "aa"), prob = 0.0625),
list(genotypes = c("aA", "AA"), prob = 0.0625),
list(genotypes = c("aA", "Aa"), prob = 0.0625),
list(genotypes = c("aA", "aA"), prob = 0.0625),
list(genotypes = c("aA", "aa"), prob = 0.0625),
list(genotypes = c("aa", "AA"), prob = 0.0625),
list(genotypes = c("aa", "Aa"), prob = 0.0625),
list(genotypes = c("aa", "aA"), prob = 0.0625),
list(genotypes = c("aa", "aa"), prob = 0.0625)
)),
list(parents = c("aA", "aa"), frequency = p1*q1^3,
sib_pairs = list(list(genotypes = c("Aa", "Aa"), prob = 0.25),
list(genotypes = c("Aa", "aa"), prob = 0.25),
list(genotypes = c("aa", "Aa"), prob = 0.25),
list(genotypes = c("aa", "aa"), prob = 0.25)
)),
list(parents = c("aa", "AA"), frequency = p1^2*q1^2,
sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 1))),
list(parents = c("aa", "Aa"), frequency = p1*q1^3,
sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
list(genotypes = c("aA", "aa"), prob = 0.25),
list(genotypes = c("aa", "aA"), prob = 0.25),
list(genotypes = c("aa", "aa"), prob = 0.25)
)),
list(parents = c("aa", "aA"), frequency = p1*q1^3,
sib_pairs = list(list(genotypes = c("aA", "aA"), prob = 0.25),
list(genotypes = c("aA", "aa"), prob = 0.25),
list(genotypes = c("aa", "aA"), prob = 0.25),
list(genotypes = c("aa", "aa"), prob = 0.25)
)),
list(parents = c("aa", "aa"), frequency = q1^4,
sib_pairs = list(list(genotypes = c("aa", "aa"), prob = 1)))
)
先定义结果存储的变量:
# 结果保存表格
result_df <- data.frame(
MatingType = character(),
SibTypeK = character(),
SibTypeL = character(),
RecombinationType = character(),
TotalProbability = numeric(),
OmegaK = numeric(),
OmegaL = numeric(),
OmegaProduct = numeric(),
TotalExpection = numeric(),
stringsAsFactors = FALSE
)
total_expectation <- 0
再遍历所有情况
for (mating in mating_types) {
mating_freq <- mating$frequency
mating_type_str <- paste(mating$parents, collapse = "x")
father_geno<-mating$parents[1]
mother_geno<-mating$parents[2]
father_allele1_k<-substr(father_geno,1,1)
father_allele2_k<-substr(father_geno,2,2)
mother_allele1_k<-substr(mother_geno,1,1)
mother_allele2_k<-substr(mother_geno,2,2)
# 遍历所有重组组合(f1, f2, m1, m2)
recomb <- expand.grid(f1 = c(TRUE, FALSE), m1 = c(TRUE, FALSE),
f2 = c(TRUE, FALSE), m2 = c(TRUE, FALSE))
recomb$prob<-NA
for (i in 1:nrow(recomb)) {
rc <- recomb[i, 1:4]
recomb_prob <- (c^sum(unlist(rc))) * ((1 - c)^(4 - sum(unlist(rc))))
recomb$prob[i] <- recomb_prob
}
# 每个同胞对在每种重组情形下,先枚举父母L单倍型配置,再构建该同胞对的L联合分布并累计期望
for (sib_pair in mating$sib_pairs) {
sib_type_str1 <- paste(sib_pair$genotypes, collapse = "_")
sib1_k <- sib_pair$genotypes[1]
sib2_k <- sib_pair$genotypes[2]
sib_prob <- sib_pair$prob
# K 位点 IBS 乘积
sib1_k_score <- corrected_value(sib1_k, p1)
sib2_k_score <- corrected_value(sib2_k, p1)
k_product <- sib1_k_score * sib2_k_score
# K 位点父/母来源字母
s1_f_k <- substr(sib1_k, 1, 1); s1_m_k <- substr(sib1_k, 2, 2)
s2_f_k <- substr(sib2_k, 1, 1); s2_m_k <- substr(sib2_k, 2, 2)
for (i in 1:nrow(recomb)) {
rc <- recomb[i, 1:4]
recomb_prob <- recomb$prob[i]
recomb_type <- sprintf(
"Father:%s%s|Mother:%s%s",
ifelse(rc$f1, "R", "N"),
ifelse(rc$f2, "R", "N"),
ifelse(rc$m1, "R", "N"),
ifelse(rc$m2, "R", "N")
)
# 父母各自两条染色体在 L 位点的B/b条件概率(仅依赖各自染色体在K位点的字母)
father_h1_Lp <- cond_L_given_K(father_allele1_k, father_allele2_k, rc$f1)
father_h2_Lp <- cond_L_given_K(father_allele2_k, father_allele1_k, rc$f2)
mother_h1_Lp <- cond_L_given_K(mother_allele1_k, mother_allele2_k, rc$m1)
mother_h2_Lp <- cond_L_given_K(mother_allele2_k, mother_allele1_k, rc$m2)
# 枚举父亲/母亲的 L 单倍型配置(这一步将父母的L字母具体化,供两个同胞共享)
father_haplo_configs <- list(
list(l_letters = c("B","B"), prob = father_h1_Lp["B"] * father_h2_Lp["B"]),
list(l_letters = c("B","b"), prob = father_h1_Lp["B"] * father_h2_Lp["b"]),
list(l_letters = c("b","B"), prob = father_h1_Lp["b"] * father_h2_Lp["B"]),
list(l_letters = c("b","b"), prob = father_h1_Lp["b"] * father_h2_Lp["b"])
)
mother_haplo_configs <- list(
list(l_letters = c("B","B"), prob = mother_h1_Lp["B"] * mother_h2_Lp["B"]),
list(l_letters = c("B","b"), prob = mother_h1_Lp["B"] * mother_h2_Lp["b"]),
list(l_letters = c("b","B"), prob = mother_h1_Lp["b"] * mother_h2_Lp["B"]),
list(l_letters = c("b","b"), prob = mother_h1_Lp["b"] * mother_h2_Lp["b"])
)
# 为该同胞对 + 该重组格局构建 L 位点联合分布
l_types <- list(
list(genotypes = c("BB", "BB"),prob=0),
list(genotypes = c("BB", "Bb"),prob=0),
list(genotypes = c("BB", "bB"),prob=0),
list(genotypes = c("BB", "bb"),prob=0),
list(genotypes = c("Bb", "BB"),prob=0),
list(genotypes = c("Bb", "Bb"),prob=0),
list(genotypes = c("Bb", "bB"),prob=0),
list(genotypes = c("Bb", "bb"),prob=0),
list(genotypes = c("bB", "BB"),prob=0),
list(genotypes = c("bB", "Bb"),prob=0),
list(genotypes = c("bB", "bB"),prob=0),
list(genotypes = c("bB", "bb"),prob=0),
list(genotypes = c("bb", "BB"),prob=0),
list(genotypes = c("bb", "Bb"),prob=0),
list(genotypes = c("bb", "bB"),prob=0),
list(genotypes = c("bb", "bb"),prob=0)
)
# 通过父母 L 单倍型配置推导两个同胞的 L 联合分布
for (father_haplo in father_haplo_configs) {
for (mother_haplo in mother_haplo_configs) {
config_prob <- father_haplo$prob * mother_haplo$prob
# sib1:从父母各自获得的 L 位点 B/b 概率(取决于 K 来源与是否重组,以及父母两条染色体在 L 的具体状态)
s1_f_L <- child_parent_L_probs(s1_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f1)
s1_m_L <- child_parent_L_probs(s1_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m1)
s1_L_dist <- list(
BB = s1_f_L["B"] * s1_m_L["B"],
Bb = s1_f_L["B"] * s1_m_L["b"],
bB = s1_f_L["b"] * s1_m_L["B"],
bb = s1_f_L["b"] * s1_m_L["b"]
)
# sib2 同理
s2_f_L <- child_parent_L_probs(s2_f_k, father_allele1_k, father_allele2_k, father_haplo$l_letters[1], father_haplo$l_letters[2], rc$f2)
s2_m_L <- child_parent_L_probs(s2_m_k, mother_allele1_k, mother_allele2_k, mother_haplo$l_letters[1], mother_haplo$l_letters[2], rc$m2)
s2_L_dist <- list(
BB = s2_f_L["B"] * s2_m_L["B"],
Bb = s2_f_L["B"] * s2_m_L["b"],
bB = s2_f_L["b"] * s2_m_L["B"],
bb = s2_f_L["b"] * s2_m_L["b"]
)
# 合并成同胞对在 L 位点的联合分布,并累加到 l_types
for (s1_l in names(s1_L_dist)) {
for (s2_l in names(s2_L_dist)) {
joint_prob_l <- as.numeric(s1_L_dist[[s1_l]]) * as.numeric(s2_L_dist[[s2_l]]) * config_prob
for (idx in 1:length(l_types)) {
if (l_types[[idx]]$genotypes[1] == s1_l && l_types[[idx]]$genotypes[2] == s2_l) {
l_types[[idx]]$prob <- l_types[[idx]]$prob + joint_prob_l
}
}
}
}
}
}
# 用该同胞对的 l_types 来累加期望,并记录到结果表
for (l_type in l_types) {
sib_type_str2 <- paste(l_type$genotypes, collapse = "_")
joint_prob_l <- l_type$prob
# 计算 l 位点 IBS 得分乘积
s1_l <- l_type$genotypes[1]
s2_l <- l_type$genotypes[2]
s1_l_score <- corrected_value(s1_l, p2)
s2_l_score <- corrected_value(s2_l, p2)
l_product <- s1_l_score * s2_l_score
# 计算总概率和总乘积
total_prob <- mating_freq * sib_prob * recomb_prob * joint_prob_l
total_product <- k_product * l_product
# 累加到总期望,并记录到结果表
total_expectation <- total_expectation + total_product * total_prob
new_row <- data.frame(
MatingType = mating_type_str,
SibTypeK = sib_type_str1,
SibTypeL = sib_type_str2,
RecombinationType = recomb_type,
TotalProbability = total_prob,
OmegaK = k_product,
OmegaL = l_product,
OmegaProduct = total_product,
TotalExpection = total_product * total_prob
)
result_df <- rbind(result_df, new_row)
}
}
}
}
展示结果:
cat("期望值 E(Ωk Ωl) =", total_expectation, "\n")
## 期望值 E(Ωk Ωl) = 1.25
我们使用模拟出不同的单倍型然后进行抽样的方式来检验我们的numerical simulation是否符合预期。
保持与numerical模拟相同即可
set.seed(123)
p1 <- 0.5 # k位点A等位基因频率
q1 <- 1 - p1 # k位点a等位基因频率
p2 <- 0.5 # l位点B等位基因频率
q2 <- 1 - p2 # l位点b等位基因频率
c <- 0
D <- 0.25
之后,我们可以基于等位基因频率计算出四种单倍型的频率,为后续产生配子库做准备
# 单体型频率计算
haplo_freq <- c(p1*p2 + D, p1*q2 - D, q1*p2 - D, q1*q2 + D)
names(haplo_freq) <- c("AB", "Ab", "aB", "ab")
# 单体型定义数据框
haplo_df <- data.frame(
haplo = names(haplo_freq),
k = substr(names(haplo_freq), 1, 1),
l = substr(names(haplo_freq), 2, 2),
prob = haplo_freq
)
# 校正基因型值函数
corrected_value <- function(genotype, p) {
if (genotype %in% c("AA", "BB")) {
(2 - 2*p) / sqrt(2 * p * (1 - p))
} else if (genotype %in% c("Aa", "Bb")) {
(1 - 2*p) / sqrt(2 * p * (1 - p))
} else if (genotype %in% c("aa", "bb")) {
(0 - 2*p) / sqrt(2 * p * (1 - p))
} else {
stop("无效基因型")
}
}
# 配子生成函数:从亲本的单倍型中按照重组率抽取单倍型
generate_gamete <- function(parent_haplos, recomb) {
is_recomb <- runif(1) < recomb # 判断是否重组
if (is_recomb) { # 如果重组
if(runif(1)<0.5){
gamete <- c(
k = parent_haplos$k[1],
l = parent_haplos$l[2],
recomb = TRUE
)
}else{
gamete <- c(
k = parent_haplos$k[2],
l = parent_haplos$l[1],
recomb = TRUE
)
}
} else {
idx <- sample(1:2, 1)
gamete <- c(
k = parent_haplos$k[idx],
l = parent_haplos$l[idx],
recomb = FALSE
)
}
return(as.list(gamete))
}
# 基因型格式化
format_geno <- function(alleles) {
sorted <- rev(sort(alleles))
paste0(sorted, collapse = "")
}
我们模拟1000个家庭,先定义一个模拟家庭的函数:
simulate_family <- function(father, mother, recomb) {
# 生成Sib1配子
f_gamete1 <- generate_gamete(father, recomb)
m_gamete1 <- generate_gamete(mother, recomb)
# 生产Sib2配子
f_gamete2 <- generate_gamete(father, recomb)
m_gamete2 <- generate_gamete(mother, recomb)
# 生成子代基因型
sib1_k <- format_geno(c(f_gamete1$k, m_gamete1$k))
sib1_l <- format_geno(c(f_gamete1$l, m_gamete1$l))
sib2_k <- format_geno(c(f_gamete2$k, m_gamete2$k))
sib2_l <- format_geno(c(f_gamete2$l, m_gamete2$l))
# 确定重组类型
recomb_type <- paste0(
"S1:",
ifelse(f_gamete1$recomb,"R","N"),
ifelse(m_gamete1$recomb,"R","N"),
"|",
"S2:",
ifelse(f_gamete2$recomb,"R","N"),
ifelse(m_gamete2$recomb,"R","N")
)
# 计算Omega值
omega_k <- corrected_value(sib1_k, p1) * corrected_value(sib2_k, p1)
omega_l <- corrected_value(sib1_l, p2) * corrected_value(sib2_l, p2)
# Mating type
mating_type <- paste(format_geno(c(father$k[1], father$k[2])),
format_geno(c(mother$k[1], mother$k[2])),
sep="x"
)
return(data.frame(
MatingType = mating_type,
SibTypeK = paste(c(sib1_k, sib2_k), collapse = "_"),
SibTypeL = paste(c(sib1_l, sib2_l), collapse = "_"),
RecombinationType = recomb_type,
OmegaK = omega_k,
OmegaL = omega_l,
OmegaProduct = omega_k * omega_l
))
}
然后再模拟每个家庭,将所有结果整合起来
# 批量模拟1000家庭
n_families<-1000
results <- vector("list", n_families)
for(f in 1:n_families){
# 生成父母单体型
father <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
mother <- haplo_df[sample(1:4, 2, replace = TRUE, prob = haplo_freq), ]
# 模拟家庭
res <- simulate_family(father, mother, c)
results[[f]] <- res
}
results<-rbindlist(results)
results_summary <- results %>%
group_by(MatingType, SibTypeK, SibTypeL, RecombinationType,
OmegaK, OmegaL, OmegaProduct) %>%
summarise(
TotalProbability = dplyr::n() / n_families,
.groups = 'drop'
) %>%
mutate(
TotalExpection = OmegaProduct * TotalProbability
)
cat("期望值 E(Ωk Ωl) =", sum(results_summary$TotalExpection), "\n")
## 期望值 E(Ωk Ωl) = 1.124
可以看出这个数值与理论值还是比较接近的
为了进一步验证我们模拟的准确性,我们做了10个位点两两之间的理论模拟与实际模拟,共10000个家庭,看下是否偏差依然存在
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
可以看出拟合得非常完美。