We treat the world as a migration matrix of performance: rows = origin countries, columns = destination countries, each cell = mean PISA math of that origin’s children living in that destination (pooled cycles 2003–2022, students whose two parents share an origin). PISA scores are also shown on the IQ scale (500 → 100, SD 90, so 6 PISA points = 1 IQ point).
Due to sample sizes, privacy suppression of origin country, unequal population sizes, and differential migration, only a tiny fraction of origin × destination pairs are populated.
allO <- sort(unique(coverage_counts$origin)); allD <- sort(unique(coverage_counts$dest))
grid <- expand_grid(origin=allO, destination=allD) |>
left_join(coverage_counts, by=c("origin"="origin","destination"="dest")) |>
mutate(filled = !is.na(n) & n>=25)
ordO <- grid |> group_by(origin) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(origin)
ordD <- grid |> group_by(destination) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(destination)
grid <- grid |> mutate(origin=factor(origin,levels=rev(ordO)), destination=factor(destination,levels=ordD))
pct <- round(100*sum(grid$filled)/nrow(grid),1)
ggplot(grid, aes(destination, origin, fill=filled)) + geom_tile() +
scale_fill_manual(values=c(`FALSE`="firebrick3",`TRUE`="forestgreen"), labels=c("no data (n<25)","n ≥ 25"), name=NULL) +
labs(title=sprintf("Origin × destination coverage: %d of %d cells (%.1f%%)", sum(grid$filled), nrow(grid), pct),
x="Destination (most data → least)", y="Origin (most data → least)") +
theme_minimal() + theme(axis.text=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank(), legend.position="bottom")
Some example rows (PISA math by destination, in order of sample size):
ex_tab <- function(x) x |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(ex_tab(examples$Turkey), caption="Turks")
| Destination | N | Math | IQ |
|---|---|---|---|
| AUT | 1478 | 413 | 85.5 |
| DEU | 1286 | 427 | 87.8 |
| DNK | 1285 | 424 | 87.3 |
| CHE | 959 | 447 | 91.2 |
| BEL | 858 | 430 | 88.3 |
| NLD | 348 | 460 | 93.3 |
| FIN | 111 | 416 | 86.0 |
| TUR | 64 | 395 | 82.5 |
| LIE | 46 | 471 | 95.2 |
knitr::kable(ex_tab(examples$India), caption="Indians")
| Destination | N | Math | IQ |
|---|---|---|---|
| AUS | 1114 | 549 | 108.2 |
| CAN | 883 | 515 | 102.5 |
| GBR | 88 | 510 | 101.7 |
| BRN | 80 | 509 | 101.5 |
| NZL | 37 | 527 | 104.5 |
knitr::kable(ex_tab(examples$China), caption="Chinese")
| Destination | N | Math | IQ |
|---|---|---|---|
| MAC | 17209 | 542 | 107.0 |
| HKG | 11797 | 549 | 108.2 |
| AUS | 1258 | 583 | 113.8 |
| CAN | 878 | 603 | 117.2 |
| NZL | 794 | 577 | 112.8 |
| KAZ | 116 | 433 | 88.8 |
| FIN | 68 | 488 | 98.0 |
| PAN | 60 | 401 | 83.5 |
| NLD | 42 | 519 | 103.2 |
| PRT | 41 | 507 | 101.2 |
The strict global-hereditarian model says migrants perform at their origin’s level wherever they go, so we can predict diaspora performance from origin national IQ. Countries that send migrants but don’t run PISA (red) get a home value imputed from NIQ.
od <- origin_level |> mutate(grp=ifelse(source=="PISA","has PISA home","NIQ-imputed"))
r <- round(cor(od$imm, od$NIQ, use="complete"),2)
ggplot(od, aes(NIQ, imm)) + geom_smooth(method="lm", se=FALSE, color="grey70") +
geom_point(aes(size=N, color=grp), alpha=.7) +
geom_text(aes(label=iso, color=grp), size=2.6, vjust=-0.8, check_overlap=TRUE, show.legend=FALSE) +
scale_color_manual(values=c("has PISA home"="grey25","NIQ-imputed"="firebrick"), name=NULL) +
labs(title=sprintf("Diaspora competence vs origin NIQ (r = %.2f)", r),
x="Origin NIQ", y="Diaspora mean (PISA composite)", size="migrant N") + theme_minimal() + theme(legend.position="bottom")
India is the big outlier — Indians abroad far outperform Indians at home, hinting migrants aren’t representative.
score ~ origin_NIQ + destination_NIQ + year (standardized betas; both predictors on the IQ scale):
wsd <- function(x,w) sqrt(sum(w*(x-weighted.mean(x,w))^2)/sum(w)); z <- function(x,w) (x-weighted.mean(x,w))/wsd(x,w)
i3 <- imm3 |> mutate(zm=z(math,w), zo=z(oN,w), zd=z(dN,w))
ms <- list(`(1) Origin`=lm(zm~zo+factor(year),i3,weights=w), `(2) Destination`=lm(zm~zd+factor(year),i3,weights=w),
`(3) Both`=lm(zm~zo+zd+factor(year),i3,weights=w))
modelsummary(ms, output="flextable", coef_map=c(zo="Origin NIQ (std β)", zd="Destination NIQ (std β)"),
coef_omit="Inter|year", gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3),
stars=c(`*`=.05,`**`=.01,`***`=.001), notes="Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999.") |> ft_style()
| (1) Origin | (2) Destination | (3) Both |
|---|---|---|---|
Origin NIQ (std β) | 0.253*** | 0.088*** | |
(0.003) | (0.003) | ||
Destination NIQ (std β) | 0.480*** | 0.451*** | |
(0.003) | (0.003) | ||
Adj. R² | 0.067 | 0.221 | 0.227 |
* p < 0.05, ** p < 0.01, *** p < 0.001 | |||
Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999. | |||
Destination NIQ is the far stronger predictor. The same holds when origin (dosage) and destination (dummies) are left free — comparing block partial R²:
oc <- oc_ref
R2d <- function(r) summary(lm(reformulate(r,"math"),dat3,weights=dat3$w))$r.squared
rf<-R2d(c(oc,"destination","factor(year)")); ro<-R2d(c(oc,"factor(year)")); rd<-R2d(c("destination","factor(year)")); ry<-R2d("factor(year)")
R2n <- function(r) summary(lm(reformulate(r,"math"),imm3,weights=imm3$w))$r.squared
nb<-R2n(c("oN","dN","factor(year)")); no<-R2n(c("oN","factor(year)")); nd<-R2n(c("dN","factor(year)"))
part <- data.frame(Model=c("Dosage + dummies","NIQ (single index)"),
`Origin only`=c(ro,no), `Dest only`=c(rd,nd), Both=c(rf,nb),
`Origin partial`=c(rf-rd,nb-nd), `Dest partial`=c(rf-ro,nb-no), check.names=FALSE)
flextable(part) |> colformat_double(j=2:6, digits=3) |> bold(j=c("Origin partial","Dest partial"), part="body") |> ft_style()
Model | Origin only | Dest only | Both | Origin partial | Dest partial |
|---|---|---|---|---|---|
Dosage + dummies | 0.228 | 0.279 | 0.320 | 0.041 | 0.091 |
NIQ (single index) | 0.067 | 0.221 | 0.227 | 0.006 | 0.160 |
Destination winning is explained by migrant selectivity. From the IAB Brain-Drain database we build a selection index = logit(high-educated % among emigrants) − logit(high-educated % of the origin population), and correlate it with the origin’s residual from NIQ:
r_no <- round(cov.wt(cbind(m$sel_logit,m$resid_noDest), wt=m$N, cor=TRUE)$cor[1,2],2)
r_w <- round(cov.wt(cbind(m$sel_logit,m$resid_wDest), wt=m$N, cor=TRUE)$cor[1,2],2)
ml <- m |> select(iso,N,sel_logit,resid_noDest,resid_wDest) |>
pivot_longer(c(resid_noDest,resid_wDest), names_to="type", values_to="resid") |>
mutate(panel=ifelse(type=="resid_noDest", sprintf("No destination control (r=%.2f)",r_no), sprintf("Destination-adjusted (r=%.2f)",r_w)))
ggplot(ml, aes(sel_logit, resid)) + geom_hline(yintercept=0,color="grey85") +
geom_smooth(aes(weight=N), method="lm", se=FALSE, color="steelblue") +
geom_point(aes(size=N), alpha=.6) + geom_text(aes(label=iso), size=2.4, vjust=-0.7, check_overlap=TRUE) +
facet_wrap(~panel) + scale_size_area(max_size=9) +
labs(title="Educational selection predicts the origin residual from NIQ",
x="Educational selection (logit gap, IAB)", y="Origin residual from NIQ (PISA pts)", size="migrant N") +
theme_minimal() + theme(legend.position="bottom")
Computing the selection index at the origin × destination level (it varies by corridor, hence not collinear with origin NIQ) lets us combine everything. Outcome on the IQ scale; all 7 subset models:
mods <- list(
`O`=lm(mathIQ~oNIQ,cm4,weights=n), `D`=lm(mathIQ~dNIQ,cm4,weights=n), `S`=lm(mathIQ~corr_logit,cm4,weights=n),
`O+D`=lm(mathIQ~oNIQ+dNIQ,cm4,weights=n), `O+S`=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n),
`D+S`=lm(mathIQ~dNIQ+corr_logit,cm4,weights=n), `O+D+S`=lm(mathIQ~oNIQ+dNIQ+corr_logit,cm4,weights=n))
modelsummary(mods, output="flextable",
coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (O)", dNIQ="Destination NIQ (D)", corr_logit="Selection (S)"),
gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
notes=sprintf("Corridor-cell OLS, n-weighted, n = %d corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection.", nrow(cm4))) |> ft_style()
| O | D | S | O+D | O+S | D+S | O+D+S |
|---|---|---|---|---|---|---|---|
Intercept | 65.446*** | -85.871* | 101.867*** | -111.473** | 59.919*** | -12.791 | -35.500 |
(8.306) | (41.194) | (0.977) | (38.845) | (5.262) | (34.175) | (26.253) | |
Origin NIQ (O) | 0.341*** | 0.330*** | 0.473*** | 0.457*** | |||
(0.091) | (0.083) | (0.059) | (0.055) | ||||
Destination NIQ (D) | 1.855*** | 1.812*** | 1.162** | 0.982*** | |||
(0.419) | (0.390) | (0.346) | (0.265) | ||||
Selection (S) | 4.008*** | 4.583*** | 3.580*** | 4.201*** | |||
(0.484) | (0.381) | (0.477) | (0.372) | ||||
Adj. R² | 0.118 | 0.162 | 0.413 | 0.275 | 0.649 | 0.470 | 0.691 |
* p < 0.05, ** p < 0.01, *** p < 0.001 | |||||||
Corridor-cell OLS, n-weighted, n = 97 corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection. | |||||||
O+S (origin NIQ modified by selection) reaches almost the full model. Its fit, corridor by corridor:
cm4$fit <- predict(mods[["O+S"]])
ggplot(cm4, aes(fit, mathIQ)) + geom_abline(slope=1,intercept=0,linetype=2,color="grey70") +
geom_point(aes(size=n), alpha=.45, color="steelblue") +
geom_text(aes(label=paste0(origin_iso,"→",destination_iso)), size=2.1, vjust=-0.6, check_overlap=TRUE) +
scale_size_area(max_size=10) +
labs(title="O+S model: fitted vs observed corridor IQ", x="Fitted IQ (origin NIQ + selection)", y="Observed corridor IQ", size="migrant N") +
theme_minimal() + theme(legend.position="bottom")
usd <- examples$USA |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(usd, caption="US-origin by destination (Mexico = returnees; inverse-variance weighting of the homogeneous low Mexico cell pulled the diaspora mean to ~440).")
| Destination | N | Math | IQ |
|---|---|---|---|
| USA | 742 | 462 | 93.7 |
PISA gives no ethnicity, but it gives home language. Filipinos abroad speak English at home far more than other Asian origins — but only ~7% do inside the Philippines, so it reflects selection + the Philippines’ English-medium schooling, not ethnicity:
knitr::kable(english_by_origin, caption="English at home (%) by origin, immigrants in the same destinations")
| origin | n | pct_English |
|---|---|---|
| Philippines | 962 | 56 |
| India | 880 | 38 |
| China | 5002 | 4 |
| Turkey | 721 | 1 |
knitr::kable(phl_homeland_lang, caption="Home language inside the Philippines (PISA 2022) — English only 6.9%")
| lang | pct |
|---|---|
| Tagalog | 42.8 |
| Sinugbuanong Binisaya/Cebuano | 11.9 |
| Another language (PHL) | 10.0 |
| English | 6.9 |
| Iloko | 5.7 |
| Bikol | 4.6 |
| Hiligaynon | 4.3 |
| Waray | 2.5 |
Malaysians are the opposite — an ethnic selection (Chinese/Indian minorities leaving under bumiputera policy): 39% speak Chinese at home (vs ~23% of Malaysia) and only 21% Malay (vs ~70%):
knitr::kable(mys_lang |> select(language=lang, n, pct), caption="Malaysian-origin home language")
| language | n | pct |
|---|---|---|
| Chinese | 72 | 36 |
| English | 64 | 32 |
| Malay | 42 | 21 |
| Another language (BRN) | 14 | 7 |
| Mandarin | 5 | 2 |
| Another language (AUS) | 1 | 0 |
| Cantonese | 1 | 0 |
The O+S model across outcomes — math is actually the weakest fit (origin ability loads on math, selection loads more on reading/science):
md <- list(Math=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n), Reading=lm(readIQ~oNIQ+corr_logit,cm4,weights=n),
Science=lm(scieIQ~oNIQ+corr_logit,cm4,weights=n), Composite=lm(compIQ~oNIQ+corr_logit,cm4,weights=n))
modelsummary(md, output="flextable", coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (IQ/IQ)", corr_logit="Selection (IQ/logit)"),
gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
notes="O+S model by outcome domain. n = 97 corridors.") |> ft_style()
| Math | Reading | Science | Composite |
|---|---|---|---|---|
Intercept | 59.919*** | 70.402*** | 62.547*** | 64.317*** |
(5.262) | (4.901) | (5.075) | (4.960) | |
Origin NIQ (IQ/IQ) | 0.473*** | 0.358*** | 0.452*** | 0.427*** |
(0.059) | (0.055) | (0.057) | (0.055) | |
Selection (IQ/logit) | 4.583*** | 5.333*** | 5.651*** | 5.189*** |
(0.381) | (0.355) | (0.368) | (0.359) | |
Adj. R² | 0.649 | 0.713 | 0.731 | 0.708 |
* p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
O+S model by outcome domain. n = 97 corridors. | ||||
The three “domains” correlate ~0.89 within a single country (German natives), i.e. they are effectively one scholastic g factor (cf. Pokropek et al. 2022):
knitr::kable(cor_DEU, caption="PISA domain correlations, German natives (mean r = 0.89)")
| math | read | scie | |
|---|---|---|---|
| math | 1.000 | 0.854 | 0.923 |
| read | 0.854 | 1.000 | 0.891 |
| scie | 0.923 | 0.891 | 1.000 |