Introduction

Turkheimer claims:

For all the hereditarians’ idle intuitions about differences being part genetic and part environmental, where is the empirical or quantitative theory that describes how this apportioning is supposed to work? There is no such thing as a “group heritability coefficient,” no way to put any meat on the speculative bones about partial genetic determination.

There is in fact such a model and it’s quite simple to understand. This simulation series shows how to model it from first principles.

Start

Load packages.

options(digits = 2)
library(pacman)
p_load(kirkegaard, readr, dplyr, gganimate, ICC, gridExtra)

Simulation 1 - Simple polygenic architecture

We start simple. Let’s say there are 10 genetic variants (locusses) that control the trait variation. We use these assumptions:

Each locus has 2 alleles, a positive and a negative.
Each locus/allele has the same effect size, only differing in direction.
All alleles have the same frequency of .50.
There are two arbitrary groups of equal size.

set.seed(1)
#sample sizes
n_total = 100000

#simulate the score counts
d1 = data_frame(polygenic_score = rbinom(n_total, 10, .5),
                  group = sample(c("A", "B"), replace = T, size = n_total))

#plot
ggplot(d1, aes(polygenic_score, fill = group)) +
  geom_histogram(position = "dodge") +
  scale_x_continuous("Polygenic score", breaks = 0:10) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GG_save("figs/sim1.png")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#describe
d1 %$% describeBy(polygenic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad min max range  skew kurtosis
## X1    1 49830    5 1.6      5       5 1.5   0  10    10 -0.01    -0.21
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 50170    5 1.6      5       5 1.5   0  10    10 0.01    -0.21 0.01

Simple, and yet we still get something that is beginning to look like a normal distribution. The groups of course have nearly identical distributions, since group membership was random.

Simulation 2 - Simple polygenic architecture with noise

We continue as before, except now we add a little environmental noise. We use these assumptions:

Each locus has 2 alleles, a positive and a negative.
Each locus/allele has the same effect size, only differing in direction.
There’s some environmental noise.
All alleles have the same frequency of .50.
There are two arbitrary groups.

For visualization purposes, it is easier to switch to empirical density fits now. We plot both for comparison.

set.seed(1)
#simulate the summed scores
d2 = data_frame(polygenic_score = rbinom(n_total, 10, .5),
                  unshared_environment = rnorm(n_total),
                  phenotypic_score = polygenic_score + unshared_environment,
                  group = sample(c("A", "B"), replace = T, size = n_total))

#plot
ggplot(d2, aes(phenotypic_score, color = group)) +
  geom_histogram(position = "dodge", alpha = .1) +
  theme_bw() +
  scale_x_continuous("Phenotypic score")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GG_save("figs/sim2_hist.png")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(d2, aes(phenotypic_score, color = group)) +
  geom_density() +
  theme_bw() +
  scale_x_continuous("Phenotypic score")

GG_save("figs/sim2_density.png")

#describe
d2 %$% describeBy(polygenic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 49880    5 1.6      5       5 1.5   0  10    10    0     -0.2 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 50120    5 1.6      5       5 1.5   0  10    10    0    -0.22 0.01

d2 %$% describeBy(unshared_environment, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean sd median trimmed mad  min max range  skew kurtosis se
## X1    1 49880    0  1      0       0   1 -4.3 3.9   8.2 -0.01    -0.02  0
## -------------------------------------------------------- 
## group: B
##    vars     n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 50120    0  1   0.01       0   1 -4.5 4.3   8.9    0     0.01  0

d2 %$% describeBy(phenotypic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad  min max range  skew kurtosis
## X1    1 49880    5 1.9      5       5 1.9 -2.9  12    15 -0.01    -0.08
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad  min max range skew kurtosis
## X1    1 50120    5 1.9      5       5 1.9 -2.5  12    15 0.01    -0.13
##      se
## X1 0.01

Sill quite simple 5 genetic locuses and a little noise, and yet now we see the familiar normal distributions.

In terms of variance, since our causes are uncorrelated, the math is simple: we can just sum the variance of the polygenic scores and the environmental effects, and then take the genetic fraction of this. These numbers are given above and are approximately 1.55 and 1, meaning that heritability is 0.71. We can verify that with regression and scatterplot.

set.seed(1)
#model
lm(phenotypic_score ~ polygenic_score, data = d2) %>% summary

## 
## Call:
## lm(formula = phenotypic_score ~ polygenic_score, data = d2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.544 -0.677  0.002  0.674  4.312 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.00597    0.01048    0.57     0.57    
## polygenic_score  0.99903    0.00200  499.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 99998 degrees of freedom
## Multiple R-squared:  0.714,  Adjusted R-squared:  0.714 
## F-statistic: 2.5e+05 on 1 and 99998 DF,  p-value: <2e-16

#plot
GG_scatter(d2, "polygenic_score", "phenotypic_score") +
  scale_y_continuous("Phenotypic score") +
  scale_x_continuous("Polygenic score")

GG_save("figs/sim2_scatter.png")

ggplot(d2 %>% sample_frac, aes(polygenic_score, phenotypic_score, color = group)) +
  geom_point(alpha = .1) +
  geom_smooth(method = lm, se = F, alpha = .3) +
  theme_bw() +
  scale_y_continuous("Phenotypic score") +
  scale_x_continuous("Polygenic score")

GG_save("figs/sim2_scatter_group.png")

The regression model confirms it: R2 is about .70. The correlation is the square root of this value, about .83. When we plot the regressions by group, there are no differences in slope or intercept.

Simulation 3 - A genetic group difference

We take it one step up from before. Instead of simulating groups with the same frequencies of the variants, we vary the frequencies by group slightly.

set.seed(1)
#simulate the summed scores
d3 = data_frame(polygenic_score = rbinom(n_total/2, 10, .55),
                  group = rep("A", n_total/2)) %>% 
  rbind(
    data_frame(
      polygenic_score = rbinom(n_total/2, 10, .45),
                  group = rep("B", n_total/2)
    )
  ) %>% mutate(
      unshared_environment = rnorm(n_total),
      phenotypic_score = polygenic_score + unshared_environment
  )

#plot
ggplot(d3, aes(phenotypic_score, color = group)) +
  geom_density() +
  scale_x_continuous("Phenotypic score") +
  theme_bw()

GG_save("figs/sim3_dist_group.png")

#describe
d3 %$% describeBy(polygenic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad min max range  skew kurtosis
## X1    1 50000  5.5 1.6      6     5.5 1.5   0  10    10 -0.07    -0.21
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 50000  4.5 1.6      4     4.5 1.5   0  10    10 0.06    -0.22 0.01

d3 %$% describeBy(unshared_environment, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean sd median trimmed mad  min max range  skew kurtosis se
## X1    1 50000    0  1      0       0   1 -4.5 4.3   8.9 -0.02        0  0
## -------------------------------------------------------- 
## group: B
##    vars     n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 50000    0  1      0       0   1 -4.1 4.2   8.3    0    -0.01  0

d3 %$% describeBy(phenotypic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad  min max range  skew kurtosis
## X1    1 50000  5.5 1.9    5.5     5.5 1.9 -2.3  12    15 -0.04    -0.09
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad  min max range skew kurtosis
## X1    1 50000  4.5 1.9    4.5     4.5 1.9 -2.9  12    15 0.05     -0.1
##      se
## X1 0.01

#gap
d3 %$% SMD_matrix(polygenic_score, group)

##      A    B
## A   NA 0.64
## B 0.64   NA

Now it is starting to look more familiar. The group gap is about half a standard deviation. The cause of the group gap is entirely genetic, which can be seen from regression and scatterplot:

#model
lm(phenotypic_score ~ group, data = d3) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ group, data = d3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.841 -1.269  0.001  1.269  7.351 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.49691    0.00834   659.4   <2e-16 ***
## groupB      -0.99936    0.01179   -84.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.9 on 99998 degrees of freedom
## Multiple R-squared:  0.067,  Adjusted R-squared:  0.067 
## F-statistic: 7.19e+03 on 1 and 99998 DF,  p-value: <2e-16

lm(phenotypic_score ~ polygenic_score + group, data = d3) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ polygenic_score + group, data = d3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.538 -0.677  0.003  0.673  4.322 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.02297    0.01191    1.93    0.054 .  
## polygenic_score  0.99545    0.00201  495.89   <2e-16 ***
## groupB           0.00180    0.00665    0.27    0.786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 99997 degrees of freedom
## Multiple R-squared:  0.73,   Adjusted R-squared:  0.73 
## F-statistic: 1.35e+05 on 2 and 99997 DF,  p-value: <2e-16

#plot
ggplot(d3 %>% sample_frac, aes(polygenic_score, phenotypic_score, color = group)) +
  geom_point(alpha = .1) +
  geom_smooth(method = lm, se = F) +
  xlab("Polygenic score") + ylab("Phenotypic score") +
  theme_bw()

GG_save("figs/sim3_scatter_group.png")

#cor / A
d3 %$% cor(polygenic_score, phenotypic_score)

## [1] 0.85

d3 %$% cor(polygenic_score, phenotypic_score) %>% `^`(2)

## [1] 0.73

#E
d3 %$% cor(unshared_environment, phenotypic_score)

## [1] 0.51

d3 %$% cor(unshared_environment, phenotypic_score) %>% `^`(2)

## [1] 0.26

The model shows that group membership itself has no validity when we take into account the genetic differences, which is to say that the group differnece is 100% genetic (between group heritability is 100%). At the same time, the scatterplot shows that the genetic architecture works the same way: the slope and intercept are identical.

Simulation 4 - Mixed cause group difference

But what if genetics is not the only cause? In the real world, we cannot expect genetics to solely explain all group differences. How can we simulate a situation like this? We can add a shared environment effect, but let it differ in means by group.

#simulate the summed scores
d4 = data_frame(polygenic_score = rbinom(n_total/2, 10, .55),
                  shared_environment = rnorm(n_total/2, 1, 1),
                  group = rep("A", n_total/2)) %>% 
  rbind(
    data_frame(
      polygenic_score = rbinom(n_total/2, 10, .45),
      shared_environment = rnorm(n_total/2, 0, 1),
      group = rep("B", n_total/2)
    )
  ) %>% mutate(
      unshared_environment = rnorm(n_total),
      phenotypic_score = polygenic_score + unshared_environment + shared_environment
  )

#plot
ggplot(d4, aes(phenotypic_score, color = group)) +
  geom_density() +
  xlab("Phenotypic score") +
  theme_bw()

GG_save("figs/sim4_dist_group.png")

#describe
d4 %$% describeBy(polygenic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad min max range  skew kurtosis
## X1    1 50000  5.5 1.6      6     5.5 1.5   0  10    10 -0.09    -0.16
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 50000  4.5 1.6      4     4.5 1.5   0  10    10 0.06    -0.17 0.01

d4 %$% describeBy(unshared_environment, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean   sd median trimmed  mad  min max range skew kurtosis
## X1    1 50000    0 0.99      0       0 0.99 -3.8 4.1     8    0    -0.03
##    se
## X1  0
## -------------------------------------------------------- 
## group: B
##    vars     n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 50000    0  1   0.01       0   1 -4.7 4.2   8.9    0    -0.04  0

d4 %$% describeBy(phenotypic_score, group)

## 
##  Descriptive statistics by group 
## group: A
##    vars     n mean  sd median trimmed mad  min max range  skew kurtosis
## X1    1 50000  6.5 2.1    6.5     6.5 2.1 -2.1  16    18 -0.03    -0.02
##      se
## X1 0.01
## -------------------------------------------------------- 
## group: B
##    vars     n mean  sd median trimmed mad  min max range skew kurtosis
## X1    1 50000  4.5 2.1    4.5     4.5 2.1 -3.9  13    17 0.03    -0.08
##      se
## X1 0.01

#gap
d4 %$% SMD_matrix(polygenic_score, group) #we use the pooled within group SD, as normally done

##      A    B
## A   NA 0.64
## B 0.64   NA

Now the group gap has become somewhat larger, about .70 d. Now, however, the group difference is not just genetic, which we can confirm in analyses:

#cors
wtd.cors(d4[-3]) %>% write_clipboard()

##                      Polygenic score Shared environment
## Polygenic score                 1.00               0.14
## Shared environment              0.14               1.00
## Unshared environment            0.00               0.00
## Phenotypic score                0.77               0.58
##                      Unshared environment Phenotypic score
## Polygenic score                      0.00             0.77
## Shared environment                   0.00             0.58
## Unshared environment                 1.00             0.42
## Phenotypic score                     0.42             1.00

#modeling
lm(phenotypic_score ~ group, data = d4) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ group, data = d4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.570 -1.427  0.002  1.433  9.123 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.50856    0.00943     690   <2e-16 ***
## groupB      -2.01297    0.01334    -151   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 99998 degrees of freedom
## Multiple R-squared:  0.185,  Adjusted R-squared:  0.185 
## F-statistic: 2.28e+04 on 1 and 99998 DF,  p-value: <2e-16

lm(phenotypic_score ~ polygenic_score + group, data = d4) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ polygenic_score + group, data = d4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.204 -0.944 -0.004  0.950  6.647 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.00192    0.01692    59.2   <2e-16 ***
## polygenic_score  1.00030    0.00285   350.8   <2e-16 ***
## groupB          -1.00269    0.00939  -106.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.4 on 99997 degrees of freedom
## Multiple R-squared:  0.635,  Adjusted R-squared:  0.635 
## F-statistic: 8.69e+04 on 2 and 99997 DF,  p-value: <2e-16

lm(phenotypic_score ~ shared_environment + group, data = d4) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ shared_environment + group, data = d4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7.63  -1.26   0.00   1.27   7.70 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.50211    0.01016   541.8   <2e-16 ***
## shared_environment  1.00450    0.00585   171.6   <2e-16 ***
## groupB             -1.00244    0.01312   -76.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.9 on 99997 degrees of freedom
## Multiple R-squared:  0.371,  Adjusted R-squared:  0.371 
## F-statistic: 2.95e+04 on 2 and 99997 DF,  p-value: <2e-16

lm(phenotypic_score ~ polygenic_score + shared_environment + group, data = d4) %>% summary()

## 
## Call:
## lm(formula = phenotypic_score ~ polygenic_score + shared_environment + 
##     group, data = d4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.738 -0.674  0.002  0.675  4.183 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.01595    0.01235    1.29     0.20    
## polygenic_score     0.99772    0.00201  495.28   <2e-16 ***
## shared_environment  0.99820    0.00315  316.90   <2e-16 ***
## groupB             -0.00109    0.00735   -0.15     0.88    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 99996 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.818 
## F-statistic: 1.5e+05 on 3 and 99996 DF,  p-value: <2e-16

#residuals
d4$resid_G = lm(phenotypic_score ~ polygenic_score, data = d4) %>% resid()
d4$resid_C = lm(phenotypic_score ~ shared_environment, data = d4) %>% resid()
d4$resid_GC = lm(phenotypic_score ~ polygenic_score + shared_environment, data = d4) %>% resid()

#residual gaps
d4 %$% SMD_matrix(phenotypic_score, group)

##      A    B
## A   NA 0.95
## B 0.95   NA

d4 %$% SMD_matrix(resid_G, group)

##      A    B
## A   NA 0.64
## B 0.64   NA

d4 %$% SMD_matrix(resid_C, group)

##      A    B
## A   NA 0.43
## B 0.43   NA

d4 %$% SMD_matrix(resid_GC, group)

##         A       B
## A      NA 0.00081
## B 0.00081      NA

#plot
sim4_scatter1 = ggplot(d4 %>% sample_frac, aes(polygenic_score, phenotypic_score, color = group)) +
  geom_point(alpha = .1) +
  geom_smooth(method = lm, se = F) +
  xlab("Polygenic score") + ylab("Phenotypic score") +
  theme_bw()

sim4_scatter2 = ggplot(d4 %>% sample_frac, aes(shared_environment, phenotypic_score, color = group)) +
  geom_point(alpha = .1) +
  geom_smooth(method = lm, se = F) +
  xlab("Shared environment") + ylab("Phenotypic score") +
  theme_bw()

#plot together
GG_save("figs/sim4_scatters.png",
        plot = grid.arrange(grobs = list(sim4_scatter1, sim4_scatter2), ncol = 1))

There are multiple things of interest. First, note that genetics and environment have become correlated at about .10. This happened due to the common cause: group membership. The A group has both better environment and better genetics. Second, when we analyze the data with just genetics as a predictor, we see that group membership has residual validity, meaning that we are missing some other factor (or not measuring genetics perfectly). When we add the missing variable, shared environment, group membership goes to 0. Third, the scatterplot confirms the missing variable, which shows up as an intercept difference (cf. intercept bias and omitted variable bias). Of course, the same is true from the perspective of shared environment.

There are many ways one can calculate the between group heritability (Defries, 1972; Jensen 1998, p. 447), but one fairly simple way is using intraclass corerlations:

\[ h^2_B \approx h^2_W \frac{(1-t)r}{(1-r)t} \] where \(h^2_B\) is the between group heritability, \(h^2_W\) is the within group heritability, t is the intraclass phenotypic correlation, and r is the intraclass genotypic correlation. This is approximation that holds when the number of cases is large. In our case, we can calculate the required values to derive the between group heritability.

#within group heritabilities
#derived from correlation between genotypic scores and phenotypic scores
d4 %>% 
  plyr::ddply("group", function(x) cor(x$polygenic_score, x$phenotypic_score)^2) %>% 
  print() %$% 
  mean(V1) ->
  sim4_h2w

##   group   V1
## 1     A 0.55
## 2     B 0.55

#intraclass correlations
sim4_t = ICC::ICCbare(group, phenotypic_score, d4) %>% print()

## [1] 0.31

sim4_r = ICC::ICCbare(group, polygenic_score, d4) %>% print()

## [1] 0.17

#between group heritability
sim4_h2w * ((1 - sim4_t) * sim4_r)/((1 - sim4_r) * sim4_t)

## [1] 0.25

So, we find that the between group heritability is .25, or in plain terms, genetic differences explain 25% of the group difference. To ensure we did things correctly, we apply the same method to the results from Simulation 3, where we know the entire group difference is explained by genetics.

#within group heritabilities
#derived from correlation between genotypic scores and phenotypic scores
d3 %>% 
  plyr::ddply("group", function(x) cor(x$polygenic_score, x$phenotypic_score)^2) %>% 
  print %$% 
  mean(V1) ->
  sim3_h2w

##   group   V1
## 1     A 0.71
## 2     B 0.71

#intraclass correlations
sim3_t = ICC::ICCbare(group, phenotypic_score, d3) %>% print

## [1] 0.13

sim3_r = ICC::ICCbare(group, polygenic_score, d3) %>% print

## [1] 0.17

#between group heritability
sim3_h2w * ((1 - sim3_t) * sim3_r)/((1 - sim3_r) * sim3_t)

## [1] 1

And we see that the value is very close to 1.00, as it should be.

Simulation 5 - Ancestry and trait levels (admixture analysis)

In reality, we never have perfect measurement of the genetic and environment causes, so we must either deal with measurement error, or use a more crude method. The simplest of the crude methods is the admixture analysis. This can be used in populations that are mixed between two or more ancestral groups, and where we are able to estimate which fraction of a given person’s genome is from each ancestral group. This method is routinely used in medical genetics (e.g. Cheng et al, 2012).

Simulating this scenario is more complex than previous. We increase the number of genetic locuses to 100, and add another 1000 locuses used to estimate an individual’s genomic ancestry. By using different locuses for the trait of interest and ancestry, we assume that there’s no pleiotropy. Otherwise, we use the same assumptions as before.

To get mixed persons, we have to mate. We do this using the simplest model possible, random mating, which is to say that person’s are no more likely to mate with someone closer to their own trait level, or closer to their own ancestry. We do a few generations of random mating to ensure a fairly mixed population, even if somewhat unrealistic population. Oh and everybody is a hermaphrodite for simplicity. Finally, we change the ratio of the population sizes to 4:1.

set.seed(1)
#sample sizes
n_large = 5000
n_small = 5000
n_polygenic_variants = 100
n_ancestry_variants = 1000

#simulating the genetic data is now more complicated as we must simulate each locus
d5_t = rbind(
  rbinom(n_large * n_polygenic_variants, 1, .55) %>% matrix(ncol = n_polygenic_variants),
  rbinom(n_small * n_polygenic_variants, 1, .45) %>% matrix(ncol = n_polygenic_variants)
  )

#ancestry locuses
d5_a = rbind(
  rep(1, n_large * n_ancestry_variants) %>% matrix(ncol = n_ancestry_variants),
  rep(0, n_small * n_ancestry_variants) %>% matrix(ncol = n_ancestry_variants)
)

#combined
d5_g = cbind(
  d5_t,
  d5_a
)

#mating function
mate_them = function(genetic_mat, generations = 5, offspring = 2) {
  #initial 0 generations
  if (generations == 0) return(genetic_mat)
  
  #find pairs
  mating_pairs = matrix(sample(1:nrow(genetic_mat)), ncol = 2)
  
  #loop over pairs
  genetic_mat_new = plyr::adply(mating_pairs, .margins = 1, .expand = F, .fun = function(this_pair) {
    #loop over offspring
    their_offspring = plyr::laply(1:offspring, function(this_offspring) {
      #which cols (variants) from parent 1
      parent_1_cols = rbernoulli(ncol(genetic_mat)) %>% as.logical
      
      #make offspring base from parent 2
      this_offspring = genetic_mat[this_pair[2], ]
      
      #insert parent 1's
      this_offspring[parent_1_cols] = genetic_mat[this_pair[1], parent_1_cols]
      
      this_offspring
    })
    
    their_offspring
  }) %>% 
    .[-1] %>% 
    as.matrix
  #remove unneeded col, convert to matrix again
  #this could be done more efficiently
  
  #recursive function
  mate_them(genetic_mat_new, generations = generations - 1, offspring = offspring)
}

#mate 5 times
d5_list = list(
  g0 = d5_g
)
d5_list$g1 = mate_them(d5_list$g0, generations = 1)
d5_list$g2 = mate_them(d5_list$g1, generations = 1)
d5_list$g3 = mate_them(d5_list$g2, generations = 1)
d5_list$g4 = mate_them(d5_list$g3, generations = 1)
d5_list$g5 = mate_them(d5_list$g4, generations = 1)

#compute summary scores, function
compute_scores = function(x) {
  data_frame(
    polygenic_score = x[, 1:n_polygenic_variants] %>% rowSums,
    ancestry = x[, (n_polygenic_variants+1):(n_polygenic_variants + n_ancestry_variants)] %>% rowMeans, #as a fraction of group 1
    unshared_environment = rnorm(nrow(x), 0, 3),
    phenotypic_score = polygenic_score + unshared_environment
  )
}

#compute for each generation
d5_list_summary = map(d5_list, compute_scores)

#easier name
d5 = d5_list_summary$g5

#scatterplot by generation
d5_long = ldf_to_df(d5_list_summary, by_name = "frame")
scatter_over_g = ggplot(d5_long, aes(ancestry, phenotypic_score, frame = frame)) +
  geom_point() +
  geom_smooth(method = lm, se = T, fullrange = T, color = "orange") +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
  theme_bw()

#animated
gganimate::gganimate(scatter_over_g, filename = "sim5_scatter_by_gen.gif")

## Executing: 
## convert -loop 0 -delay 100 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png 'sim5_scatter_by_gen.gif'

## Output at: sim5_scatter_by_gen.gif

#in a single picture
#plot each in a list first
sim5_scatters = map(1:6, function(i) {
  #get generation name
  this_gen = names(d5_list)[i]
  
  #this g
  this_d = d5_long %>% 
    filter(frame == this_gen)
  
  #correlation
  this_cor = wtd.cors(this_d[c("ancestry", "phenotypic_score")])[2, 1]
  
  #plot
  this_d %>% 
    ggplot(aes(ancestry, phenotypic_score, frame = frame)) +
    geom_point(size = .01) +
    geom_smooth(method = lm, se = T, fullrange = T, color = "orange") +
    scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
    theme_bw() +
    ggtitle(sprintf("%s: r = %.2f", this_gen, this_cor)) +
    scale_y_continuous(breaks = seq(0, 100, by = 10))
})
GG_save("figs/sim5_scatter_by_gen.png",
        plot = grid.arrange(grobs = sim5_scatters, ncol = 3))

We are now ready to examine the relationship between ancestry and the trait score across generations of intermating.

We do see a slight relationship between phenotypic score and ancestry, as expected. The correlation effect size itself is not so information because it depends on the variation in ancestry, and in this figure we can see that there is actually fairly little such variation after just 5 generations of random mating. Still, if we follow the regression line to x = 0, we get the predicted value of about 45, and if we follow it to x = 1, we get the predicted value of 55. This gap between the extrema of x represents the size of the genetic difference. We know this of course because we began by simulating data for the groups such that the mean polygenic score by group were 55 and 45. We can verify this by analysis.

#cors
wtd.cors(d5)

##                      polygenic_score ancestry unshared_environment
## polygenic_score                1.000    0.162               -0.028
## ancestry                       0.162    1.000               -0.001
## unshared_environment          -0.028   -0.001                1.000
## phenotypic_score               0.855    0.140                0.495
##                      phenotypic_score
## polygenic_score                  0.85
## ancestry                         0.14
## unshared_environment             0.49
## phenotypic_score                 1.00

#model
lm(phenotypic_score ~ ancestry, data = d5) %>% summary

## 
## Call:
## lm(formula = phenotypic_score ~ ancestry, data = d5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.421  -3.851  -0.006   3.942  23.588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.508      0.323   141.0   <2e-16 ***
## ancestry       8.990      0.635    14.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.8 on 9998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0196 
## F-statistic:  201 on 1 and 9998 DF,  p-value: <2e-16

lm(phenotypic_score ~ ancestry + polygenic_score, data = d5) %>% summary

## 
## Call:
## lm(formula = phenotypic_score ~ ancestry + polygenic_score, data = d5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.035  -2.057   0.018   2.035  10.153 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.81386    0.32278    2.52    0.012 *  
## ancestry         0.11934    0.33709    0.35    0.723    
## polygenic_score  0.98294    0.00605  162.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 9997 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.731 
## F-statistic: 1.36e+04 on 2 and 9997 DF,  p-value: <2e-16

The first model shows that when we input ancestry alone, it is a useful predictor of the phenotypic score, but when it is along with the polygenic score, it useless, because it is in fact just a proxy for the polygenic score.

This final simulation can be made more realistic with the addition of non-random mating, in particular endogamic, but the modeling will not greatly change. The points remains, however, that conceptually understanding genetic group differences in polygenic traits is not particularly difficult.

Simulation 6 - Non-random mating

Finally, we need to add non-random mating and SIRE groups to the simulation. We do this by matching up pairs based on their phenotypic scores. In other words, there is no direct own-ancestry preference except, only same-phenotype preference. This is somewhat unrealistic, but most mating is probably related to trait levels. In real life there would be many (education, intelligence, music taste etc.), but here we use just a single.

set.seed(1)
#we use the same founder populations as simulation 5
#so we dont need to simulate them again

#function to perform non-random mating
#done by sorting the two columns based on a noisy version of one of them
#effectively results in inperfect sorting for the non-noisy version
non_random_pair = function(pair_phenotypes, noise = 1) {
  pair_phenotypes %<>% 
    as_data_frame() %>% 
    set_colnames(c("p1", "p2")) %>% 
    mutate(
      #add initial numbers, so we can keep track of original positions
      p1n = 1:n(),
      p2n = 1:n()
    ) %>% 
    #sort by p2's true score
    arrange(p2)
  
  #subset the p1 part, and sort by noisy phenotype to produce imperfect true sorting
  p1 = pair_phenotypes[c("p1n", "p1")] %>% 
    mutate(
      #standardize and add noise
      noisy_p1 = standardize(p1) + rnorm(nrow(pair_phenotypes), mean = noise)
    ) %>% 
    #sort by noisy
    arrange(noisy_p1)
  
  #now we join the sorted p2 with the sorted p1, and these will be our semi-matched pairs
  chosen_pairs = cbind(
    p1$p1n,
    #add the number of rows to p2 to get back to the complete dataset number
    pair_phenotypes$p2n + nrow(pair_phenotypes)
  )

  chosen_pairs
}

#mating function, vesion 2
mate_them2 = function(genetic_mat, generations = 5, offspring = 2) {
  #initial 0 generations
  if (generations == 0) return(genetic_mat)
  
  #randomize rows
  genetic_mat = genetic_mat[sample(1:nrow(genetic_mat)), ]

  #calculate polygenic scores
  #polygenic_scores = genetic_mat[, 1:n_polygenic_variants] %>% rowmMeans()
  ancestry = genetic_mat[, -(1:n_polygenic_variants)] %>% rowMeans()
  phenotypic_scores = ancestry
  
  #find pairs
  mating_pairs = non_random_pair(data_frame(
    #get first half
    phenotypic_scores[1:(nrow(genetic_mat)/2)],
    #second half
    phenotypic_scores[(nrow(genetic_mat)/2 + 1):nrow(genetic_mat)]
  ) %>% as.matrix())
  
  #loop over pairs
  genetic_mat_new = plyr::adply(mating_pairs, .margins = 1, .expand = F, .fun = function(this_pair) {
    #loop over offspring
    their_offspring = plyr::laply(1:offspring, function(this_offspring) {
      #which cols (variants) from parent 1
      parent_1_cols = rbernoulli(ncol(genetic_mat)) %>% as.logical()
      
      #make offspring base from parent 2
      this_offspring = genetic_mat[this_pair[2], ]
      
      #insert parent 1's
      this_offspring[parent_1_cols] = genetic_mat[this_pair[1], parent_1_cols]
      
      this_offspring
    })
    
    their_offspring
  }) %>% 
    .[-1] %>% 
    as.matrix()
  #remove unneeded col, convert to matrix again
  #this could be done more efficiently
  
  #recursive function
  mate_them2(genetic_mat_new, generations = generations - 1, offspring = offspring)
}

#mate 5 times
d6_list = list(
  #shuffle the rows, so groups arent cleanly split to begin with
  g0 = d5_g
)
d6_list$g1 = mate_them2(d6_list$g0, generations = 1)
d6_list$g2 = mate_them2(d6_list$g1, generations = 1)
d6_list$g3 = mate_them2(d6_list$g2, generations = 1)
d6_list$g4 = mate_them2(d6_list$g3, generations = 1)
d6_list$g5 = mate_them2(d6_list$g4, generations = 1)

#compute for each generation
d6_list_summary = map(d6_list, compute_scores)

#easier name
d6 = d6_list_summary$g5

#scatterplot by generation
d6_long = ldf_to_df(d6_list_summary, by_name = "frame")
scatter_over_g = ggplot(d6_long, aes(ancestry, phenotypic_score, frame = frame)) +
  geom_point() +
  geom_smooth(method = lm, se = T, fullrange = T, color = "orange") +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
  theme_bw()

#animated
gganimate::gganimate(scatter_over_g, filename = "sim6_scatter_by_gen.gif")

## Executing: 
## convert -loop 0 -delay 100 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png 'sim6_scatter_by_gen.gif'

## Output at: sim6_scatter_by_gen.gif

#in a single picture
#plot each in a list first
sim6_scatters = map(1:6, function(i) {
  #get generation name
  this_gen = names(d6_list)[i]
  
  #this g
  this_d = d6_long %>% 
    filter(frame == this_gen)
  
  #correlation
  this_cor = wtd.cors(this_d[c("ancestry", "phenotypic_score")])[2, 1]
  
  #plot
  this_d %>% 
    ggplot(aes(ancestry, phenotypic_score, frame = frame)) +
    geom_point(size = .01) +
    geom_smooth(method = lm, se = T, fullrange = T, color = "orange") +
    scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
    theme_bw() +
    ggtitle(sprintf("%s: r = %.2f", this_gen, this_cor)) +
    scale_y_continuous(breaks = seq(0, 100, by = 10))
})
GG_save("figs/sim6_scatter_by_gen.png",
        plot = grid.arrange(grobs = sim6_scatters, ncol = 3))

Did we succeed? What does the distribution of ancestry look like?

#standard deviations
d5_long %>% 
  filter(frame == "g5") %>% 
  .[["ancestry"]] %>% 
  sd()

## [1] 0.091

d6_long %>% 
  filter(frame == "g5") %>% 
  .[["ancestry"]] %>% 
  sd()

## [1] 0.34

#sim5
sim6_sim5_ancestry_dist = d5_long %>% 
  filter(frame == "g5") %>% 
  GG_denhist("ancestry", vline = F) +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

#sim6
sim6_sim6_ancestry_dist = d6_long %>% 
  filter(frame == "g5") %>% 
  GG_denhist("ancestry", vline = F) +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

GG_save("figs/sim6_ancestry_compare.png",
        plot = grid.arrange(grobs = list(sim6_sim5_ancestry_dist, sim6_sim6_ancestry_dist), ncol = 2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#SIRE using ancestry%
d6_long$group = map_lgl(d6_long$ancestry, ~rbernoulli(1, p = .)) %>% plyr::mapvalues(c(T, F), c("A", "B"))

#plot phenotypic scores by SIRE
d6_long %>% 
  filter(frame == "g5") %>% 
  ggplot(aes(phenotypic_score, color = group)) +
  geom_density() +
  xlab("Phenotypic score") +
  theme_bw()

GG_save("figs/sim6_dist_group.png")

#ancestry by group
d6_long %>% 
  filter(frame == "g5") %>% 
  GG_denhist("ancestry", group = "group") +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
  theme_bw()

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing missing values (geom_bar).

GG_save("figs/sim6_SIRE_ancestry.png")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing missing values (geom_bar).

#regression by group
d6_long %>% 
  filter(frame == "g5") %>% 
  ggplot(aes(ancestry, phenotypic_score, color = group)) +
  geom_point(alpha = .1) +
  geom_smooth(method = lm, se = F) +
  scale_x_continuous("Ancestry", limits = c(0, 1), labels = scales::percent) +
  ylab("Phenotypic score") +
  theme_bw()

GG_save("figs/sim6_scatter_by_SIRE.png")

Since everything seems to be in order and the regression slopes still pointed at the correct ancestry means, let us then see if the between group heritability comes out at near 100% as it should.

#subset we want
d6_g5e = d6_long %>% 
  filter(frame == "g5")

#within group heritabilities
#derived from correlation between genotypic scores and phenotypic scores
d6_g5e %>% 
  plyr::ddply("group", function(x) cor(x$polygenic_score, x$phenotypic_score)^2) %>% 
  print() %$% 
  mean(V1) ->
  sim6_h2w

##   group   V1
## 1     A 0.78
## 2     B 0.76

#intraclass correlations
sim6_t = ICC::ICCbare(group, phenotypic_score, d6_g5e) %>% print()

## [1] 0.2

sim6_r = ICC::ICCbare(group, polygenic_score, d6_g5e) %>% print()

## [1] 0.24

#between group heritability
sim6_h2w * ((1 - sim6_t) * sim6_r)/((1 - sim6_r) * sim6_t)

## [1] 1

Not possible to determine?

Turkheimer also claims:

The hereditarians are always saying that any day now there will be new genomic data that will settle the question once and for all. They are wrong. More on that in subsequent posts.

We do claim that, and we are right. Genomic results do show the expected patterns (Kirkegaard et al. 2016, in review). Some population geneticists have already concluded that inter-European population differences in height and BMI are partly genetic in origin (Robinson et al, 2015), and there is some evidence from variant frquencies that this is also the case for intelligence, educational attainment etc. for the world at large (Piffer, 2017 etc.). More datasets are coming, stronger methods exist (e.g. sibling comparisons, admixture mapping), and it is possible to find out what is true and what is not without strict experimental evidence. I do not share Turkheimer’s anti-scientific, defeatist attitude.

How do polygenic group differences work?

Remedial statistical genetics for Professor Turkheimer

Emil O. W. Kirkegaard