Jewish people have higher IQ on average than non-Jewish people. Has polygenic selection played a role in this difference? This is a well-known hypothesis, described here on Wikipedia, and here by Steven Pinker, but it has never been empirically tested. We investigate the question using data from the Wisconsin Longitudinal Study. This data sets contains an IQ measure as well as polygenic scores for educational attainment (edu PGS). (Educational attainment is highly correlated with IQ, and is sometimes used as a proxy for IQ).
Note that the comparison of PGS values between populations may not be an approach that works scientifically, as indicated by research such as this paper. Hopefully this question will be explored further in the coming years.
library(pacman)
p_load(tidyverse, pander, stringr, lavaan, magrittr, lavaanPlot, ggpubr, broom, pwr, simsem)
PanderOpts(digits = 3, table.split.table = Inf, missing = "-", table.alignment.default = "left")
theme_set(theme_bw(base_size=13))
A data set with PGS is prepared here. See link for details about data preparation.
read_df <- function(){
read_delim("pgs_data.csv", delim=";", col_types=cols(si001re = col_double())) %>%
select(idpub, relfml, z_relr75, PGS=pgs_std, IQ=z_iq100, birthyear=z_brdxdy,
ses57, income=z_yfam74, sexrsp=z_sexrsp, rtype)
}
df <- read_df()
Keep only Jewish people and the two major religious groups, Lutherans and Catholics
df %<>% filter(relfml %in% c("Roman Catholic", "Lutheran", "Jewish")) %>%
mutate(Religion = ifelse(relfml == "Jewish", "Jewish", "Christian"),
isJewish = Religion == "Jewish") %>%
drop_na(Religion, PGS)
df_iq <- df %>% drop_na(IQ)
df %>% select(Religion) %>% table %>% pander(caption="Sample size")| Christian | Jewish |
|---|---|
| 4630 | 53 |
plot_dist <- function(df, fill){
plot_density <- function(attr) {
ggplot(df, aes_string(x=attr, fill=fill)) +
geom_density(alpha=0.6)
}
ggarrange(plot_density("IQ"), plot_density("PGS"), ncol=2, nrow=1, common.legend=T)
}
df_iq %>% plot_dist("Religion")
df %<>% mutate(
ses57_std = as.numeric(scale(ses57)),
logses57_std = as.numeric(scale(log(ses57))))
m <- lm(PGS ~ Religion, df)
m %>% summary() %>% pander()
m %>% confint_tidy(conf.level=0.95, parm="ReligionJewish") %>%
pander(caption="95% confidence interval")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0148 | 0.0145 | -1.02 | 0.308 |
| ReligionJewish | 1.38 | 0.136 | 10.1 | 7.19e-24 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4683 | 0.986 | 0.0214 | 0.0212 |
| conf.low | conf.high |
|---|---|
| 1.11 | 1.65 |
m <- lm(PGS ~ Religion + ses57_std + log(income) + sexrsp, df)
m %>% summary() %>% pander()
m %>% confint_tidy(conf.level=0.95, parm="ReligionJewish") %>%
pander(caption="95% confidence interval")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0317 | 0.0562 | -0.565 | 0.572 |
| ReligionJewish | 1.29 | 0.146 | 8.81 | 1.76e-18 |
| ses57_std | 0.107 | 0.0156 | 6.88 | 6.8e-12 |
| log(income) | 0.00512 | 0.0113 | 0.452 | 0.651 |
| sexrspmale | 0.00477 | 0.0312 | 0.153 | 0.878 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4050 | 0.984 | 0.0353 | 0.0343 |
| conf.low | conf.high |
|---|---|
| 1 | 1.58 |
ggplot(df_iq, aes(x=PGS, y=IQ)) +
geom_jitter(aes(color="Christian"), alpha=0.5) +
geom_jitter(data=df_iq %>% filter(Religion == "Jewish"), aes(color="Jewish"), size=2) +
labs(color="Religion") +
theme(legend.position = c(0, 1),legend.justification = c(-0.01, 1.01)) +
scale_color_manual(values = c("blue","orangered2"))
model <- 'IQ ~ c * isJewish + b * PGS
PGS ~ a * isJewish
ab := a * b
tot := ab + c
prop_mediated := ab/tot'
fit <- sem(model, df_iq)
lavaanPlot(model=fit, node_options = list(shape = "circle", fontname = "Helvetica"),
edge_options = list(color = "grey"), coefs = T, stand=T)
summary(fit, standardized=T, header=F)##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## IQ ~
## isJewish (c) 2.394 1.992 1.202 0.229 2.394 0.017
## PGS (b) 4.429 0.213 20.835 0.000 4.429 0.298
## PGS ~
## isJewish (a) 1.363 0.137 9.922 0.000 1.363 0.145
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .IQ 199.595 4.181 47.734 0.000 199.595 0.910
## .PGS 0.969 0.020 47.734 0.000 0.969 0.979
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ab 6.034 0.674 8.958 0.000 6.034 0.043
## tot 8.428 2.062 4.087 0.000 8.428 0.060
## prop_mediated 0.716 0.172 4.160 0.000 0.716 0.716
model <- 'IQ ~ isJewish
PGS ~ isJewish
ses57 ~ isJewish
ses57 ~ PGS
IQ ~ ses57
IQ ~ PGS'
fit <- sem(model, df_iq)
lavaanPlot(model=fit, node_options = list(shape = "circle", fontname = "Helvetica"),
edge_options = list(color = "grey"), coefs = T, stand=T, stars=T)
summary(fit, standardized=T, header=F)##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## IQ ~
## isJewish -2.311 1.945 -1.188 0.235 -2.311 -0.017
## PGS ~
## isJewish 1.363 0.137 9.922 0.000 1.363 0.145
## ses57 ~
## isJewish 13.350 1.431 9.327 0.000 13.350 0.137
## PGS 1.062 0.153 6.951 0.000 1.062 0.102
## IQ ~
## ses57 0.352 0.020 17.672 0.000 0.352 0.246
## PGS 4.054 0.207 19.613 0.000 4.054 0.272
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .IQ 186.794 3.913 47.734 0.000 186.794 0.851
## .PGS 0.969 0.020 47.734 0.000 0.969 0.979
## .ses57 103.083 2.160 47.734 0.000 103.083 0.967
Since there is not significant different, it is justified to merge them into one group.
df_iq %>% filter(!isJewish) %>% mutate(Denomination = relfml) %>% plot_dist("Denomination")
It is possible that the Jewish people in Wisconsin are somehow not representative of the national Jewish population. There are not a lot of good available data on this, so we investigate this by looking at the income levels of federal employees.
We find public salary information from www.fedsdatacenter.com, a list of Jewish surnames form wikipedia, and a list of the largest US cities and their states from Wikipedia. The list of cities is used to assign each person a US state.
For every person we calculate the relative salary, which is the salary of that person divided by the mean salary in the location the person works at.
Finally we look at whether the relative salaries of Jewish people in Wisconsin is higher than the relative salaries in other US states.
get_relative_salaries <- function(){
cities <- read_delim("cities.csv", delim="\t", col_names=F) %>%
mutate(Location = str_replace(X2, "\\[.*\\]", ""),
state = trimws(X3)) %>%
select(Location, state)
jewish_names <- read_csv("jewish_names.txt", col_names = c("surname")) %>%
separate(surname, into="surname")
salaries <- read_delim("salaries.csv", delim=";", col_types=cols(.default="c")) %>%
separate(Name, into=c("surname", "first_name")) %>%
mutate(Salary = as.numeric(str_replace_all(Salary, c("\\$"="", ","=""))),
surname = str_to_title(surname),
Location = str_to_title(Location)) %>%
select(surname, Salary, Location) %>%
filter(Salary > 0) %>%
mutate(log_salary = log(Salary))
merge(salaries, cities, by="Location", all.y=T) %>%
mutate(State = ifelse(state=="Wisconsin", "Wisconsin", "Other US States"),
isJewish = ifelse(surname %in% jewish_names$surname, T, F)) %>% na.omit
}
#get_relative_salaries() %>% write_tsv("relative_salaries.tsv")
dfs <- read_tsv("relative_salaries.tsv")
We find that there is no difference
jewish <- dfs %>% group_by(Location) %>%
mutate(relative_salary = Salary / mean(Salary),
log_relative_salary = log_salary / mean(log_salary)) %>% filter(isJewish)
ggplot(jewish, aes(y=relative_salary, x=State)) +
geom_boxplot() + ylab("Relative salary")t.test(relative_salary ~ State, jewish)| Test statistic | df | P value | Alternative hypothesis | mean in group Other US States | mean in group Wisconsin |
|---|---|---|---|---|---|
| -0.0418 | 421 | 0.967 | two.sided | 1.09 | 1.09 |
ggplot(jewish, aes(y=log_relative_salary, x=State)) +
geom_boxplot() + ylab("log(Relative salary)")t.test(log_relative_salary ~ State, jewish)| Test statistic | df | P value | Alternative hypothesis | mean in group Other US States | mean in group Wisconsin |
|---|---|---|---|---|---|
| 0.137 | 423 | 0.891 | two.sided | 1.01 | 1.01 |
The Health and Retirement Study.
Data processing here.
Note that the cognition score is based on a test that is much less accurate than the one in WLS.
hrs <- read_tsv("df_std.tsv") %>%
mutate(Religion = case_when(RARELIG == 3 ~ "Jewish", RARELIG == 1 ~ "Christian"),
isJewish = Religion == "Jewish") %>%
filter(Religion %in% c("Jewish", "Christian")) %>%
select(Religion, Cognition = cog1_100, GENDER, PGS_EDU = PGS_EDU_std, PGS_COG = PGS_COG_std, isJewish) %>%
drop_na(isJewish, PGS_EDU, Cognition)
hrs %>% select(Religion) %>% table %>% pander(caption="Sample size")| Christian | Jewish |
|---|---|
| 6517 | 212 |
plot_dist <- function(df){
plot_density <- function(attr) {
ggplot(df, aes_string(x=attr, fill="Religion")) +
geom_density(alpha=0.6)
}
ggarrange(plot_density("Cognition"),
plot_density("PGS_EDU"),
plot_density("PGS_COG"),
ncol=3, nrow=1, common.legend=T)
}
plot_dist(hrs %>% filter(Cognition < 140, Cognition > 60))m <- lm(PGS_EDU ~ Religion, hrs)
m %>% summary() %>% pander()
m %>% confint_tidy(conf.level=0.95, parm="ReligionJewish") %>%
pander(caption="95% confidence interval")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0371 | 0.0122 | -3.04 | 0.00236 |
| ReligionJewish | 1.15 | 0.0688 | 16.7 | 1.82e-61 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 6729 | 0.985 | 0.0399 | 0.0397 |
| conf.low | conf.high |
|---|---|
| 1.01 | 1.28 |
ggplot(hrs, aes(x=PGS_EDU, y=Cognition)) +
geom_jitter(aes(color="Christian"), alpha=0.5) +
geom_jitter(data=hrs %>% filter(Religion == "Jewish"), aes(color="Jewish"), size=2) +
labs(color="Religion") +
theme(legend.position = c(0, 1),legend.justification = c(-0.01, 1.01)) +
scale_color_manual(values = c("blue","orangered3"))
model <- 'Cognition ~ c * isJewish + b * PGS_EDU
PGS_EDU ~ a * isJewish
ab := a * b
tot := ab + c
prop_mediated := ab/tot'
fit <- sem(model, hrs)
lavaanPlot(model=fit, node_options = list(shape = "circle", fontname = "Helvetica"),
edge_options = list(color = "grey"), coefs = T, stand=T)
summary(fit, standardized=T, header=F)##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## Cognition ~
## isJewish (c) 0.868 0.930 0.934 0.350 0.868 0.011
## PGS_EDU (b) 2.051 0.162 12.696 0.000 2.051 0.156
## PGS_EDU ~
## isJewish (a) 1.149 0.069 16.715 0.000 1.149 0.200
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .Cognition 170.453 2.939 58.004 0.000 170.453 0.975
## .PGS_EDU 0.970 0.017 58.004 0.000 0.970 0.960
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ab 2.357 0.233 10.110 0.000 2.357 0.031
## tot 3.226 0.922 3.498 0.000 3.226 0.043
## prop_mediated 0.731 0.215 3.405 0.001 0.731 0.731
df <- read_df() %>% mutate(Religion = relfml)
df %>% drop_na(IQ) %>% group_by(Religion) %>%
summarise(n = n(), Mean_IQ = mean(IQ), sd_IQ = sd(IQ), Mean_PGS = mean(PGS), sd_PGS = sd(PGS)) %>%
filter(n > 1)| Religion | n | Mean_IQ | sd_IQ | Mean_PGS | sd_PGS |
|---|---|---|---|---|---|
| Baptist, Southern Baptist, Free Will Baptist, Primitive Baptist | 109 | 95.3 | 14.2 | -0.113 | 0.954 |
| Episcopalian, Anglican | 76 | 105 | 17 | 0.27 | 1.1 |
| Evangelical | 13 | 100 | 16.9 | 0.281 | 1.05 |
| Evangelical Free | 6 | 106 | 21.3 | -0.123 | 0.54 |
| Jehovah’s Witnesses | 9 | 97.3 | 18.9 | -0.355 | 0.926 |
| Jewish | 52 | 108 | 14.6 | 1.35 | 1.08 |
| Lutheran | 1982 | 99.6 | 14.9 | -0.0419 | 0.978 |
| Other religions, not established religion or changing religion with no clue of what they are changing to | 248 | 101 | 15.7 | -0.0305 | 1.02 |
| Pentecostal, Assembly Of God | 19 | 96.5 | 12.2 | -0.117 | 0.865 |
| Presbyterian | 211 | 103 | 15.5 | -0.025 | 0.937 |
| Protestant (No Denomination Given, Born Again, Born Again Christian, Christian, No Preference, Nothing Particular) | 43 | 100 | 17.6 | -0.0682 | 1.02 |
| Protestant, Non-Denominational (R Must Have Used The Words ‘Non-Denominational’ For Religious Preference) | 4 | 104 | 13.4 | -0.167 | 1 |
| Reformed, Dutch, Swiss, And Christian Reformed, First Reformed, Church Of America | 36 | 103 | 17.1 | 0.272 | 0.773 |
| Roman Catholic | 2523 | 99.7 | 14.7 | 0.0142 | 0.988 |
| United Church Of Christ, Congregational Evangelical And Reformed | 269 | 102 | 15.2 | 0.0911 | 1.1 |
| United Methodist, Methodist, Evangelical United Brethren, Free Methodist, Primitive Methodist, Nazarene Free Methodist | 615 | 99.7 | 15.1 | -0.0508 | 1.02 |
| - | 40 | 99.8 | 13.4 | 0.0469 | 0.85 |
53 Jewish people is not a large sample. However, this is the number that were present in the study. And small sample size is not necessarily a problem as long as the power is sufficiently good.
For instance, let’s assume that we are looking for a difference in PGS between Jewish people and Christians that is in the same level as the one usually found in IQ tests, eg around 2/3d. What would be the power to detect this difference at 0.05 significance level?
mean_Jewish <- 1.66
mean_Christian <- 1
sd <- 1
Cohen.d <- (mean_Jewish - mean_Christian)/sqrt(((sd^2) + (sd^2))/2)
tibble("WLS" = pwr.t2n.test(n1 = 53, n2 = 4630, d = Cohen.d, sig.level = 0.05)$power,
"HRS" = pwr.t2n.test(n1 = 212, n2 = 6517, d = Cohen.d, sig.level = 0.05)$power)| WLS | HRS |
|---|---|
| 0.998 | 1 |