In a post published on his website, Gwern investigates the efficiency of embryo selection. It’s impressive work. In a later revision, he added a simulation that examines the effects of selecting for genetically correlated traits.
He relies on real data about genetic correlations and has selected a set of 35 traits of interest. He finds that when one considers selection on a composite trait of the 35 traits, one can make astonishing gains (mean gain 3.64 Z) which become ever larger when one takes into account the genetic correlations between traits (5.20 Z), which tend to be favorable (i.e. positive correlations between traits we want). Unfortunately, this conclusion is based on a couple of mistakes. One can see this intuitively by considering that if one has 10 embryos to select from, it is not possible to do better than about 1/10 on average, no matter if one selects for a single or composite trait, so there must be a mistake somewhere.
The first mistake is that he assumes that the standard deviation (SD) of the composite trait is identical to that of the composite traits, i.e. 1. This is not the case. The SD of a composite variable is a non-linear function of the SDs of the constituent variables and this must be taken into account.
The second mistake is incorrect creation of the genetic correlation matrix. For instance, in Gwern’s matrix, the genetic correlation between ADHD and College education is positive at .208. However, if one looks at the source data, one can see that it should be -.397. The mistake is due to the source data.frame not having the self-self genetic correlations which then go missing as he constructs the matrix. In fact, an entire trait goes missing, since his matrix only has 34 rows/cols while the source data without duplicates has 35 (the missing trait is Years of Education).
The third mistake is a couple of errors in the reversing of the direction for some traits. For instance, the dataset he relies on reports a genetic correlation of .523 between current/former smoker and years of education, and .410 with childhood IQ. Since smoking is known to be negatively related to educational attainment and cognitive ability, they probably reverse coded the smoking variable at some point. There were some other dubious results in the dataset. For instance, autism spectrum is reported to have genetic correlation of -.130 with ADHD, despite the well-known large positive phenotypic relationship. From examining the correlations to other traits, it looks to be due to sampling error in the Autism Spectrum estimates.
A fourth mistake is that some correlations are estimated to be outside the possible range (-1 to 1) and were not winsorized.
With these points in mind, let’s redo Gwern’s analyses. First, we load the genetic correlations data:
#read data from gwern's site
rg <- read.csv("http://www.gwern.net/docs/genetics/2016-zheng-ldhub-49x49geneticcorrelation.csv", stringsAsFactors = F)
# delete redundant/overlapping/obsolete ones:
dupes <- c("BMI 2010", "Childhood Obesity", "Extreme BMI", "Obesity Class 1", "Obesity Class 2",
"Obesity Class 3", "Overweight", "Waist Circumference", "Waist-Hip Ratio", "Cigarettes per Day",
"Ever/Never Smoked", "Age at Smoking", "Extreme Height", "Height 2010")
rgClean <- rg[!(rg$Trait1 %in% dupes | rg$Trait2 %in% dupes),]
rgClean <- subset(rgClean, select=c("Trait1", "Trait2", "rg"))
#fix impossible values
rgClean$rg %<>% winsorise(upper = 1, lower = -1)
Then we reverse the genetic correlations for traits depending on which direction we want to select for the traits:
#reverse negative traits
#http://emilkirkegaard.dk/en/wp-content/uploads/trait_directions.csv
trait_directions = read.csv("trait_directions.csv", stringsAsFactors = F, row.names = 1)
#insert signs
rgClean$Trait1_sign = trait_directions[rgClean$Trait1, "Sign"]
rgClean$Trait2_sign = trait_directions[rgClean$Trait2, "Sign"]
#change rg's
rgClean$rg_turned = rgClean$rg * rgClean$Trait1_sign * rgClean$Trait2_sign
Then we reshape the data to a matrix of genetic correlations and create a matrix with null correlations:
#reshape to matrix format
rgMatrix <- acast(rgClean, Trait2 ~ Trait1, value.var = "rg_turned")
#add redundant top row and last column
rgMatrix = rbind("ADHD" = rep(NA, 34), rgMatrix)
rgMatrix = cbind(rgMatrix, "Years of Education" = rep(NA, 35))
# convert from half-matrix to full symmetric matrix
rgMatrix <- lowerUpper(t(rgMatrix), rgMatrix)
#set diagonals to 1
rgMatrix[is.na(rgMatrix)] <- 1
#genetically independent traits
independent <- matrix(ncol=35, nrow=35, 0)
diag(independent) <- 1
Before simulating data, it is worth examining the overall pattern in the genetic correlations. Since we have reversed the traits depending on which direction we want to select for (i.e. against ADHD, BMI, for IQ, education), we are interested in positive genetic correlations. We subset the lower diagonal of the matrix to avoid the 1’s in the diagonal:
#overall pattern of intercors
rgMatrix %>% MAT_get_half() %>% psych::describe(skew = F)
## vars n mean sd min max range se
## X1 1 595 0.08 0.21 -0.47 1 1.47 0.01
#plot distribution
rgMatrix %>% MAT_get_half() %>% GG_denhist() + scale_x_continuous(breaks = seq(-5, 1, .1))
The distribution of genetic correlations is slightly positive on average (r=.08), which means that selection will be somewhat faster than if the traits were independent.
Finally, we replicate Gwern’s selection analyses (rewritten to be more readable) and also save the SDs of the within batch composite trait:
set.seed(1) #reproducible results
d_batches = adply(1:10000, .margins = 1, .id = NULL, .fun = function(x) {
#simulate batch data
m_batch = rmvnorm(10, sigma=cor2cov(independent, sd=rep(sqrt(0.33*0.5), 35)), method="svd")
m_batch2 = rmvnorm(10, sigma=cor2cov(rgMatrix, sd=rep(sqrt(0.33*0.5), 35)), method="svd")
#get the composite scores
v_index_scores = rowSums(m_batch)
v_index_scores2 = rowSums(m_batch2)
#get SD and max values
data.frame("max_inde" = max(v_index_scores), "sd_inde" = sd(v_index_scores),
"max_cor" = max(v_index_scores2), "sd_cor" = sd(v_index_scores2))
})
#descriptive stats
desc = psych::describe(d_batches, skew = F); desc
## vars n mean sd min max range se
## max_inde 1 10000 3.70 1.42 -1.07 10.54 11.60 0.01
## sd_inde 2 10000 2.34 0.57 0.71 4.82 4.11 0.01
## max_cor 3 10000 7.06 2.67 -1.29 22.77 24.06 0.03
## sd_cor 4 10000 4.45 1.06 1.06 9.24 8.18 0.01
So, we can see that the we get approximately the same results. But notice that the mean SD for the composite trait assuming no genetic correlations is not 1, but about 2.3. Using this, we can see that the mean selection in Z values assuming no genetic correlations between the traits is:
desc[1, "mean"] / desc[2, "mean"]
## [1] 1.582436
pnorm(desc[1, "mean"], sd = desc[2, "mean"])
## [1] 0.943225
So, on average, we can expect to get a selection of abouot 1/20 according on the composite trait in this rough simulation. This number should have come out at about 1/10 but I suspect it’s because the SD of the batch population is incorrectly estimated by taking the mean of simulated SDs. I used two methods to better estimate this value. First, by simply simulating a very large batch and calculating the SD within it:
set.seed(1) #reproducible results
sd_large_batch = rmvnorm(10000, sigma=cor2cov(independent, sd=rep(sqrt(0.33*0.5), nrow(independent))), method="svd") %>%
rowSums() %>%
sd()
sd_large_batch
## [1] 2.403855
This results in a slightly higher value, and if we use that we get:
#how selection with corrected SD?
desc[1, "mean"] / sd_large_batch
## [1] 1.539568
pnorm(desc[1, "mean"], sd = sd_large_batch)
## [1] 0.9381672
So this is slightly closer to the expected value.
We can also derive the value one would expect had all the traits been selected in the same direction and independently. This value is:
#mathematical solution
sqrt(35 * .33*.5) #variances of uncorrelated traits are additive; SD of composite trait is sqrt of summed variance.
## [1] 2.403123
Math and experiment are in agreement :).
Finally, using this estimate of the SD of the composite, we can calculate how large our selection power is for the data where the traits are somewhat genetically correlated:
#selection from correlated traits, given current data
desc[3, "mean"] / sd_large_batch
## [1] 2.937012
pnorm(desc[3, "mean"], sd = sd_large_batch)
## [1] 0.998343
This result is implausibly large. I suspect it is because we added a large number of (semi)-duplicate traits such as College and Years of Education (genetic r=1.00), and the various cholesterol variables (LDL x Total Cholesterol, genetic r = .998). I’m not sure how to get around this problem except by being careful about which traits to include in the calculations. I don’t see a mathematical way to distinguish between trait pairs that are conceptually distinct but highly genetically correlated (cognitive ability and generalized socioeconomic factor), and trait pairs that are conceptually related and also highly genetically related (two different types of education attainment variables)