library(pacman)
p_load(pwr, knitr, ggplot2, reshape2)

TD <- function(t, n1, n2){
  d = t*sqrt((1/n1)+(1/n2))
  return(d)}

dfi <- read.csv("RatZub.csv")

Rationale

Tabery’s (2009) review of Sesardic’s Making Sense of Heritability (2005) referenced Cooper & Zubek’s (1958) study of the effects of environments on maze running performance in mice. Tabery interpreted the result as evidence that members of the “biometric tradition” would fail to interpret a potentially meaningful G \(\times\) E interaction because of its statistical insignificance. This point is hard to appreciate because its validity requires biometricians to be ignorant when, in fact, they frequently note issues of statistical power which can be augmented by measurement error, range restriction, and sampling error. Like candidate gene studies, most G \(\times\) E studies featured small samples with sometimes subpar phenotype measurements, making their wider lack of replication easily explicable (see, e.g., Uher, 2008 on reliability issues in G \(\times\) E studies). Tabery’s criticism is signally strange; it is as if he had to come up with at least one empirical example where Lewontonian interactionism held in the real world, but his only example was one with extreme limitations, no replications to date, and which any right-minded biometrician would note features very low power and no human subjects.

Sesardic (pp. 66-68) did note several limitations of the study, which I will present here, but which Tabery seemed to ignore:

  1. Since it is just one study, any generalization from it must obviously be very shaky. As the study has never been replicated one should not rush to draw far-reaching conclusions from it. In fact, Henderson’s experiments from the 1970s involving thousands of mice revealed little statistical interaction, which shows that the empirical evidence here does not speak with one voice.
  1. There is no good reason to expect that the results of the Cooper and Zubek experiment on rats would immediately carry over into the area of human behavior genetics, although this kind of overhasty extrapolation has been defended.
  1. As the rats belonged to two inbred strains it becomes even more questionable whether the results would be similar in normal organisms that are typically hybrid.
  1. The observed interaction is not of a radical (non-ordinal) type, i.e., the norms of reaction do not cross. But it is non-ordinal interactions that most strongly undermine the attribution of main effects, and consequently the causal import of heritability claims as well. It is interesting to note here that Lewontin, with his insistence on the pervasiveness of non-ordinal interactions, seriously misrepresents Cooper and Zubek’s study (although he doesn’t mention the authors by name):

strains of rats can be selected for better or poorer ability to find their way through a maze, and these strains of rats pass on their differential ability to run the maze to their offspring, so they are certainly genetically different in this respect. But if exactly the same strains of rats are given a different task, or if the conditions of learning are changed, the bright rats turn out to be dull and the dull rats turn out to be bright.

The italicized phrase seems to suggest that the two genetically different strains of rats switch their positions on the bright-dull scale, but this actually never happened.

  1. Cooper and Zubek themselves warned about methodological limitations of their experimental set-up. They suggested that the two strains of rats might have actually differed in their real learning ability even in those situations where their perforamnce was indistinguishable. It may have been, they said, that the ceiling of the test was simply too low to differentiate the animals. Moreover, they mentioned that something similar really happened with some tests of human intelligence “on which adults of varying ability may achieve similar I.Q. scores although more difficult tests reveal clear differences between them”. So there is a kind of vicious circle here. While critics of human behavior genetics often use Cooper and Zubek’s experiment to raise methodological objections against research on IQ, unbeknownst to these critics, it is precisely the research on IQ that led the authors of that study to warn the reader that the results of their experiemnt should not be overinterpreted.

In other words, there are serious concerns about the ability to interpret the result in support of either of Tabery’s contrasted paradigms. To allow interpretation we would need to overcome issues with measurement reliability and range restriction, contradicting results from other experiments (stated differently, since other studies fail to support interactions, in the absence of clear evidence for an interaction with the phenotype measured by Cooper & Zubek, there should be some reason we should consider this interaction apparently uniquely real), generalizability concerns due to inbreeding (this is common with drug testing; see Brodie, 1962; Lasagna, 1964), the applicability of the result to claims that differences can be reversed at all, and, beyond these criticisms, issues of power resulting from said reliability issues, range restriction, and small sample sizes. A further, related concern is invariance. If the manifest result is interpreted in terms of “learning ability” or “spatial memory”, it would require invariance with said ability by condition; however, in humans when range restriction occurs a phenomenon called Spearman’s Law of Diminishing Returns is often observed. When this happens, the indicators can no longer be interpreted as having the same relationship to the underlying ability as in conditions where range is unrestricted. Unfortunately, the standard deviations - which should be smaller in the extreme conditions - were not provided (and they were likely uneven, a complicating influence that I will not touch on here). If we had the p values for the bright-dull comparisons in each environmental group, we could compute them, but the best we can compute is the d for several of the conditions. To obtain that, we would use the formula

\[d = t\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

These are really only useful for getting some idea of the empirical values obtained in the study since post-hoc power analysis really just restates the p value. A weird thing worth noting is that Tabery’s commentary on this study erred. For one, there are two significant effects in theoretically interesting directions, thus threatening the interpretation that “[A] member of the biometric tradition could ignore the G \(\times\) E as too small to consider”, if we assume hereditarians are only interested in mere significance in the first place. Secondly, Tabery uses p = 0.5 (not a typo) to comment that the “interaction effect appears to be not significant”, which could signal that he made an error or that he got the value from the study (p. 161), in which case, an ANOVA was not conducted (as he said it was), and this would not be relevant since it would not show an insignificant interaction. If he performed an ANOVA on the data and reached a similar, independent result, that’s different and the result is curious. Another thing worth noting is that the Henderson result showing relatively few interactions seems to replicate in humans (Hill, Goddard & Visscher, 2008). Moreover, there are many findings which fail to transfer from mouse studies to people (Shanks, Greek & Greek, 2009; Bracken, 2009) and apparent G \(\times\) E within groups do not seem to lead to their expected effects between groups (e.g., the Scarr-Rowe effect).

There is a final issue with Tabery’s review. In an earlier review (Tabery, 2006), he wrote

Readers will find Chapter 6 to be the most controversial, where Sesardic queries the relationship between politics and heritability research. Can cognitive errors found in heritability research be attributed to political motivation? (Sesardic: Maybe, but this would require evidence that critics have never provided [pp. 186-193].) Does egalitarianism survive the results found in heritability research? (Sesardic: Not without severe modification [pp. 214-126].) Is it reasonable to racially profile individuals? (Sesardic: Yes, based on a Bayesian analysis [pp. 223-224].) If a black person and a white person both take an IQ test and receive a score of 93, what should we assume about their true IQ scores? (Sesardic: Since the average IQ score for black populations is 85, and the average IQ score for white populations is 100, ascribe a lower IQ to the black person and a higher IQ to the white person based on regression to the mean [p. 225].)

And in 2009, he wrote

In Chap. 6 Sesardic takes on the controversial political implications of his defense of the causal information found in heritability measures. Can cognitive errors found in heritability research be attributed to political motivation? (Sesardic: Maybe, but this would require evidence that critics have never provided (pp. 186- 193)). Does egalitarianism survive the results found in heritability research? (Sesardic: Not without modification (pp. 214-216).) Is it reasonable to racially profile individuals? (Sesardic: Yes, based on a Bayesian analysis (pp. 220-224).) If a black person and a white person both take an IQ test and receive a score of 93, what should we assume about their true IQ scores? (Sesardic: Since the average IQ score for white populations is 100, and the average IQ score for black populations is 85, assume the black person actually has a lower IQ than the white person (p. 225).) Throughout the chapter, Sesardic depicts himself as the stoic surveyor of ‘’hard fact[s]’’ (p. 223), who is willing to look ‘’into the abyss’’ (p. 212).

Besides his pagination error in the prior publication, he plagiarized the majority of those remarks without attribution. I have put the plagiarized sections in bold. I do not know why Tabery was allowed to do that. I have always been told that any amount of plagiarism is condemnable.

Analysis

kable(dfi[, 2:5])
Line Restricted Normal Enriched
Bright 13 11 12
Dull 9 11 9
#Bright Enriched-Normal

BEN = TD(0.715, 11, 12)

#Dull Enriched-Normal

DEN = TD(2.52, 11, 9)

#Bright Restricted-Normal

BRN = TD(4.06, 11, 13)

#Dull Restricted-Normal

DRN = TD(0.280, 11, 9)

#Enriched Dull-Bright

EDB = TD(0.819, 12, 9)

#Post-hoc power

pwr.t.test(d = BEN, n = 11.5, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 11.5
##               d = 0.2984578
##       sig.level = 0.05
##           power = 0.1049879
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
pwr.t.test(d = DEN, n = 10, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 10
##               d = 1.132656
##       sig.level = 0.05
##           power = 0.6685924
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
pwr.t.test(d = BRN, n = 12, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 12
##               d = 1.663273
##       sig.level = 0.05
##           power = 0.9732386
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
pwr.t.test(d = DRN, n = 10, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 10
##               d = 0.1258506
##       sig.level = 0.05
##           power = 0.05818877
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
pwr.t.test(d = EDB, n = 10.5, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 10.5
##               d = 0.3611451
##       sig.level = 0.05
##           power = 0.1232779
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#Minimum effect size

pwr.t.test(n = 10.8, power = 0.80, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 10.8
##               d = 1.268879
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Mean n was 21.6.

spout <- pwr.t.test(d = 0.1, n = seq(5, 200, 1), sig.level = 0.05, type = "two.sample", alternative = "two.sided")

mpout <- pwr.t.test(d = 0.5, n = seq(5, 200, 1), sig.level = 0.05, type = "two.sample", alternative = "two.sided")

lpout <- pwr.t.test(d = 0.8, n = seq(5, 200, 1), sig.level = 0.05, type = "two.sample", alternative = "two.sided")

dfpow <- data.frame("Small" = spout$power, "Medium" = mpout$power, "Large" = lpout$power, "N" = spout$n)
ggplot(dfpow, aes(x = N)) + 
  geom_line(aes(y = Small), color = "darkred", size = 1) + 
  geom_line(aes(y = Medium), color = "steelblue", linetype = "twodash", size = 1) + 
  geom_line(aes(y = Large), color = "orangered", linetype = "dashed", size = 1) + theme_minimal() + ylab("Power") + xlab("N (Each Group)") +
  geom_vline(xintercept = 10.8, linetype = "dashed", color = "steelblue", size = 0.5)

pdf <- cbind(NULL, NULL)
for(i in seq(0.05, 1.50, 0.001)){
  psig <- pwr.t.test(d = i, power = 0.8, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
  pdf <- rbind(pdf, cbind(psig$d, psig$n))
  pp <- data.frame("D" = pdf[, 1], "N" = pdf[, 2])}
ggplot(pp, aes(x = D)) + 
  geom_line(aes(y = N), color = "darkred", size = 1) + theme_minimal() + xlab("Effect Size (d)") + ylab("Significant N (Each Group; p = 0.05)") + 
  geom_hline(yintercept = 10.8, linetype = "dashed", color = "steelblue", size = 0.5)

ggplot(pp, aes(x = N)) + 
  geom_line(aes(y = D), color = "darkred", size = 1) + theme_minimal() + xlab("Significant N (Each Group; p = 0.05)") + ylab("Effect Size (d)") + 
  geom_vline(xintercept = 10.8, linetype = "dashed", color = "steelblue", size = 0.5)

Discussion

Overall, it looks like Cooper & Zubek was extremely underpowered. There were significant effects that may have resulted from the sorts of non-measurement Cooper & Zubek (but not Tabery) discussed, meaning that invariance was not achieved with the measures in the atypical enriched and restricted conditions. Replication is dearly deserved. With luck, I will be able to present that result within a few years’ time.

What I think Tabery was doing in his review was trying to score a rhetorical point rather than making an earnest scientific argument. If he was writing sincerely then it is difficult to consider him a capable individual. The rhetorical point works like this: When confronted with an a cogent point based on empirical evidence, a rebuke has to carry its own empirical weight; as such, Tabery elected to provide what could be construed as empirical evidence in his favor. However, the applicability of this proof collapses when inspected. If confronted with alternative qualification or evidence against the sole empirical basis for a theory, the idea is to fall back to theory and talk about the possible rather than the actuality that the empirical evidence would have represented. The presentation of a weak empirical work, lonesome as it is, is intended to lend credibility to outlandish theory. It’s as if he were asking researchers to accept power posing based on a series of analyses in which there was clear evidence for publication bias and other QRPs surrounding the results: it lacks all credibility. The lack of interpretability is all that can be gathered from this result. The inattention to Sesardic’s qualification of the Cooper & Zubek result is a glaring omission for those who read both publications, but perhaps if someone only reads Tabery’s review and avoids the source material and its citations, they could walk away believing the result was presented with propriety, or moreover, that Sesardic did not himself graph it (see his p. 66). As one might gather from reading Sesardic, presenting evidence and being walked back to theory as motivation for one’s beliefs is very common outside of hereditarian circles.

Relatedly, I have looked for interactions. In the postscript at https://rpubs.com/JLLJ/CIBG, I found what might have been evidence for a minor interaction such that the E factor in the ACE model of one of the studies I looked at was actually partially an E \(\times\) G factor. I cannot really speak to the appropriateness of that result, however, since it came from a small study and I believe it might have been impacted by the quality of the manifest indicator variables. Whether the reason this failed to replicate stemmed from content or not was something I would like to explore later, but which I cannot test at present because open twin data is still a pipedream. Hoping to contribute to the alleviation of that issue, all of my data and code for that analysis are provided at the link provided above. Something worth noting is that the models presented at that link are not something Tabery would consider valid, as stated at the outset of his 2014 book. There is presumably no principled reason for this because with minor qualification the method is more than adequate for delivering empirical evidence for “genes for” (as well as “environments for”) a latent trait in a substantive sense by showing that they are concordant with the proportionality constraints of the latent variable. Similar modeling is conducted with molecular data. This Pearlian outlook on causality is common in econometrics and is the basis for the variance partitioning procedures in behavior genetics.

A final thing I want to say is that there is an elementary error in the presentation of Jensen’s view, especially in phrases like “Jensen’s genetic hypothesis”. Jensen’s hypothesis was that between-group differences were a subset of within-group differences. This is basically to say that Jensen believed that measurement invariance was tenable and that the same degree of genetic and environmental influence appeared for cognitive measures and constructs in different groups. He also tacitly accepted that if there were group-specific factors, these would not be unfalsifiable/pseudoscientific factors that only affected factor means and acted entirely homogeneously and constantly within different races. This may be why he asked colleagues to review all of the evidence on group differences with modern methods like multi-group confirmatory factor analysis prior to his unfortunate death. In whatever case, it is a bizarre irony to typecast Jensen as being interested in variances more than he was in causal explanation. He wrote a book and papers on looking for causes and what amounted to cutting backdoor paths! The limit to his (and Plomin’s + others’) engaging in the “developmental” approach was always merely technological, never philosophical (see Sesardic, 2015).

References

Tabery, J. (2009). Making Sense of the Nature-Nurture Debate. Biology & Philosophy, 24(5), 711-723. https://doi.org/10.1007/s10539-009-9152-3

Sesardic, N. (2005). Making Sense of Heritability. Cambridge University Press. https://doi.org/10.1017/CBO9780511487378

Cooper, R. M., & Zubek, J. P. (1958). Effects of enriched and restricted early environments on the learning ability of bright and dull rats. Canadian Journal of Psychology/Revue Canadienne de Psychologie, 12(3), 159-164. https://doi.org/10.1037/h0083747

Uher, R. (2008). Gene-Environment Interaction: Overcoming Methodological Challenges. In Genetic Effects on Environmental Vulnerability to Disease (pp. 13-30). John Wiley & Sons, Ltd. https://doi.org/10.1002/9780470696781.ch2

Brodie, B. B. (1962). Part VI. Difficulties in extrapolating data on metabolism of drugs from animal to man. Clinical Pharmacology & Therapeutics, 3(3), 374-380. https://doi.org/10.1002/cpt196233374

Lasagna, L. (1964). The Diseases Drugs Cause. Perspectives in Biology and Medicine, 7(4), 457-470. https://doi.org/10.1353/pbm.1964.0015

Hill, W. G., Goddard, M. E., & Visscher, P. M. (2008). Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits. PLOS Genetics, 4(2), e1000008. https://doi.org/10.1371/journal.pgen.1000008

Shanks, N., Greek, R., & Greek, J. (2009). Are animal models predictive for humans? Philosophy, Ethics, and Humanities in Medicine.: PEHM, 4, 2. https://doi.org/10.1186/1747-5341-4-2

Bracken, M. B. (2009). Why animal studies are often poor predictors of human reactions to exposure. Journal of the Royal Society of Medicine, 102(3), 120-122. https://doi.org/10.1258/jrsm.2008.08k033

Tabery, J. (2006). Fueling the (in)famous fire. Metascience, 15(3), 607-611. https://doi.org/10.1007/s11016-006-9059-4

Sesardic, N. (2015). Crossing the ‘explanatory divide’: A bridge to nowhere? International Journal of Epidemiology, 44(4), 1124-1127. https://doi.org/10.1093/ije/dyv055