Sampling an Individual from a General Population with 2 Subpopulations

Suppose we have a general population that dichotomizes into two subpopulations that I will refer to as Mac and Don, for which Mac and Don compose proportions s and 1-s of the population, respectively. Observe: s,1-s ∈ (0,1).

Mac and Don are in Hardy-Weinburg equilibrium (HWE), meaning there is random mating, no inbreeding, infinite population size, discrete generations, equivalent allele frequencies in males and females, and the absence of mutation/migration/selection. Allele frequencies are \(p_1\) and \(p_2\), respectively. We are interested in sampling 100 individuals from this population to compare the observed overall population allele frequency to the expected overall population allele frequency \(\hat{p}\).

For each sample, we need to set random values of s, \(p_1\), and \(p_2\) that fall within [0,1]. Then, we will randomly sample 100 individuals from the population and determine p’s allele frequency. We expect that for a population of n individuals, \(\hat{p} = \frac{2n_{AA} + n_{Aa}}{2n}\). We can then use \(\hat{p}\) to generate expected proportions for genotypes AA, Aa, and aa.

With Mac and Don composing our population, this equation can be split into a summation of the two subpopulations’ p allele frequency, where for Mac, \(\hat{p_1} = s\frac{2n_{AA} + n_{Aa}}{2n}\) and for Don, \(\hat{p_2} = (1-s)\frac{2n_{AA} + n_{Aa}}{2n}\).

Observed and Expected Genotype Frequencies
s = 0.5 , p1 = 0.3 , p2 = 0.3
Genotype Observed Frequency Expected HWE Frequency
AA 0.120 0.109
Aa 0.420 0.442
aa 0.460 0.449
Overall allele frequency (p̂): 0.33
Observed and Expected Genotype Frequencies
s = 0.5 , p1 = 0.3 , p2 = 0.8
Genotype Observed Frequency Expected HWE Frequency
AA 0.320 0.265
Aa 0.390 0.500
aa 0.290 0.235
Overall allele frequency (p̂): 0.515
Observed and Expected Genotype Frequencies
s = 0.5 , p1 = 0.5 , p2 = 0.3
Genotype Observed Frequency Expected HWE Frequency
AA 0.190 0.144
Aa 0.380 0.471
aa 0.430 0.384
Overall allele frequency (p̂): 0.38
Observed and Expected Genotype Frequencies
s = 0.5 , p1 = 0.5 , p2 = 0.8
Genotype Observed Frequency Expected HWE Frequency
AA 0.470 0.442
Aa 0.390 0.446
aa 0.140 0.112
Overall allele frequency (p̂): 0.665
Observed and Expected Genotype Frequencies
s = 0.7 , p1 = 0.3 , p2 = 0.3
Genotype Observed Frequency Expected HWE Frequency
AA 0.110 0.081
Aa 0.350 0.408
aa 0.540 0.511
Overall allele frequency (p̂): 0.285
Observed and Expected Genotype Frequencies
s = 0.7 , p1 = 0.3 , p2 = 0.8
Genotype Observed Frequency Expected HWE Frequency
AA 0.240 0.226
Aa 0.470 0.499
aa 0.290 0.276
Overall allele frequency (p̂): 0.475
Observed and Expected Genotype Frequencies
s = 0.7 , p1 = 0.5 , p2 = 0.3
Genotype Observed Frequency Expected HWE Frequency
AA 0.140 0.130
Aa 0.440 0.461
aa 0.420 0.410
Overall allele frequency (p̂): 0.36
Observed and Expected Genotype Frequencies
s = 0.7 , p1 = 0.5 , p2 = 0.8
Genotype Observed Frequency Expected HWE Frequency
AA 0.350 0.348
Aa 0.480 0.484
aa 0.170 0.168
Overall allele frequency (p̂): 0.59

Generating Siblings and Calculating \(\lambda_s\)

In simple cases, the recurrence risk ratio depends on the degree of relatedness of the two relatives, the underlying genetic model, and the disease allele frequency (p). \[E[\lambda_s]=\frac{1}{4} + \frac{1}{2p} + \frac{1}{4p^2}\]

\(E[\lambda_s | p=0.01] = \frac{1}{4} + \frac{1}{2(0.01)} + \frac{1}{4(0.01)^2}=2550.25\)
\(E[\lambda_s | p=0.1] = \frac{1}{4} + \frac{1}{2(0.1)} + \frac{1}{4(0.1)^2}=30.25\)
\(E[\lambda_s | p=0.25] = \frac{1}{4} + \frac{1}{2(0.25)} + \frac{1}{4(0.25)^2}=6.25\)
\(E[\lambda_s | p=0.5] = \frac{1}{4} + \frac{1}{2(0.5)} + \frac{1}{4(0.5)^2}= 2.25\)
\(E[\lambda_s | p=0.75] = \frac{1}{4} + \frac{1}{2(0.75)} + \frac{1}{4(0.75)^2}= 1.361\)
\(E[\lambda_s | p=0.9] = \frac{1}{4} + \frac{1}{2(0.9)} + \frac{1}{4(0.9)^2}= 1.114\)
\(E[\lambda_s | p=0.99] = \frac{1}{4} + \frac{1}{2(0.99)} + \frac{1}{4(0.99)^2}= 1.010\)

As the disease allele frequency p approaches 1, \(\lambda_s\) approaches 1. Meanwhile, as p approaches 0, \(\lamba_s\) approaches infinity. In other words, the recurrence risk ratio converges to 1 as the disease allele frequency increases and overtakes the non-disease allele. This makes sense, because as the prevalence of disease among family members will become equivalent to the overall population disease prevalence with high disease allele frequency. This means that the recurrence risk ratio more powerfully confirms the existence of DSLs for monogenetic diseases with very small minor allele frequencies and small disease prevalence.

We can compare these expected values to the observed \(\lambda_s\), computed using \(\lambda_s = P(Y_1 = 1,Y_2 = 1)/P(Y_1 = 1)^2\), where K is the disease prevalence in the population. We can also estimate the recurrence risk ratio with \(\hat{\lambda_s}= \frac{s_{case}}{\hat{k}}\).

##     p observed_lambda_s expected_lambda_s
## 1 0.1        15.0000000         30.250000
## 2 0.2         7.8125000          9.000000
## 3 0.3         7.0493827          4.694444
## 4 0.4         4.5742187          3.062500
## 5 0.5         3.2336000          2.250000
## 6 0.6         2.3248457          1.777778
## 7 0.7         1.5743440          1.474490
## 8 0.8         1.0788574          1.265625
## 9 0.9         0.7297668          1.114198

We can see that the expected and observed \(\lambda_s\) are slightly different, but the trend is the same: as p increases from 0 to 1, \(\lambda_s\) will monotonically decrease. I believe I have an issue with my computation of the observed recurrence risk ratio, since ideally, it should not dip below 1. This would imply taht the population disease prevalence would be greater than the probability that a full sibling has a disease given their sibling has the disease, which is not logical.