We treat the world as a migration matrix of performance: rows = origin countries, columns = destination countries, each cell = mean PISA math of that origin’s children living in that destination (pooled cycles 2003–2022, students whose two parents share an origin). PISA scores are also shown on the IQ scale (500 → 100, SD 90, so 6 PISA points = 1 IQ point).

The matrix is mostly empty

Due to sample sizes, privacy suppression of origin country, unequal population sizes, and differential migration, only a tiny fraction of origin × destination pairs are populated.

allO <- sort(unique(coverage_counts$origin)); allD <- sort(unique(coverage_counts$dest))
grid <- expand_grid(origin=allO, destination=allD) |>
  left_join(coverage_counts, by=c("origin"="origin","destination"="dest")) |>
  mutate(filled = !is.na(n) & n>=25)
ordO <- grid |> group_by(origin) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(origin)
ordD <- grid |> group_by(destination) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(destination)
grid <- grid |> mutate(origin=factor(origin,levels=rev(ordO)), destination=factor(destination,levels=ordD))
pct <- round(100*sum(grid$filled)/nrow(grid),1)
ggplot(grid, aes(destination, origin, fill=filled)) + geom_tile() +
  scale_fill_manual(values=c(`FALSE`="firebrick3",`TRUE`="forestgreen"), labels=c("no data (n<25)","n ≥ 25"), name=NULL) +
  labs(title=sprintf("Origin × destination coverage: %d of %d cells (%.1f%%)", sum(grid$filled), nrow(grid), pct),
       x="Destination (most data → least)", y="Origin (most data → least)") +
  theme_minimal() + theme(axis.text=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank(), legend.position="bottom")

Some example rows (PISA math by destination, in order of sample size):

ex_tab <- function(x) x |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(ex_tab(examples$Turkey), caption="Turks")
Turks
Destination N Math IQ
AUT 1478 413 85.5
DEU 1286 427 87.8
DNK 1285 424 87.3
CHE 959 447 91.2
BEL 858 430 88.3
NLD 348 460 93.3
FIN 111 416 86.0
TUR 64 395 82.5
LIE 46 471 95.2
knitr::kable(ex_tab(examples$India), caption="Indians")
Indians
Destination N Math IQ
AUS 1114 549 108.2
CAN 883 515 102.5
GBR 88 510 101.7
BRN 80 509 101.5
NZL 37 527 104.5
knitr::kable(ex_tab(examples$China), caption="Chinese")
Chinese
Destination N Math IQ
MAC 17209 542 107.0
HKG 11797 549 108.2
AUS 1258 583 113.8
CAN 878 603 117.2
NZL 794 577 112.8
KAZ 116 433 88.8
FIN 68 488 98.0
PAN 60 401 83.5
NLD 42 519 103.2
PRT 41 507 101.2

Model 1: do migrants just carry their origin level?

The strict global-hereditarian model says migrants perform at their origin’s level wherever they go, so we can predict diaspora performance from origin national IQ. Countries that send migrants but don’t run PISA (red) get a home value imputed from NIQ.

od <- origin_level |> mutate(grp=ifelse(source=="PISA","has PISA home","NIQ-imputed"))
r <- round(cor(od$imm, od$NIQ, use="complete"),2)
ggplot(od, aes(NIQ, imm)) + geom_smooth(method="lm", se=FALSE, color="grey70") +
  geom_point(aes(size=N, color=grp), alpha=.7) +
  geom_text(aes(label=iso, color=grp), size=2.6, vjust=-0.8, check_overlap=TRUE, show.legend=FALSE) +
  scale_color_manual(values=c("has PISA home"="grey25","NIQ-imputed"="firebrick"), name=NULL) +
  labs(title=sprintf("Diaspora competence vs origin NIQ (r = %.2f)", r),
       x="Origin NIQ", y="Diaspora mean (PISA composite)", size="migrant N") + theme_minimal() + theme(legend.position="bottom")

India is the big outlier — Indians abroad far outperform Indians at home, hinting migrants aren’t representative.

Model 2: origin vs destination

score ~ origin_NIQ + destination_NIQ + year (standardized betas; both predictors on the IQ scale):

wsd <- function(x,w) sqrt(sum(w*(x-weighted.mean(x,w))^2)/sum(w)); z <- function(x,w) (x-weighted.mean(x,w))/wsd(x,w)
i3 <- imm3 |> mutate(zm=z(math,w), zo=z(oN,w), zd=z(dN,w))
ms <- list(`(1) Origin`=lm(zm~zo+factor(year),i3,weights=w), `(2) Destination`=lm(zm~zd+factor(year),i3,weights=w),
           `(3) Both`=lm(zm~zo+zd+factor(year),i3,weights=w))
modelsummary(ms, output="flextable", coef_map=c(zo="Origin NIQ (std β)", zd="Destination NIQ (std β)"),
   coef_omit="Inter|year", gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3),
   stars=c(`*`=.05,`**`=.01,`***`=.001), notes="Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999.") |> ft_style()

(1) Origin

(2) Destination

(3) Both

Origin NIQ (std β)

0.253***

0.088***

(0.003)

(0.003)

Destination NIQ (std β)

0.480***

0.451***

(0.003)

(0.003)

Adj. R²

0.067

0.221

0.227

* p < 0.05, ** p < 0.01, *** p < 0.001

Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999.

Destination NIQ is the far stronger predictor. The same holds when origin (dosage) and destination (dummies) are left free — comparing block partial R²:

oc <- oc_ref
R2d <- function(r) summary(lm(reformulate(r,"math"),dat3,weights=dat3$w))$r.squared
rf<-R2d(c(oc,"destination","factor(year)")); ro<-R2d(c(oc,"factor(year)")); rd<-R2d(c("destination","factor(year)")); ry<-R2d("factor(year)")
R2n <- function(r) summary(lm(reformulate(r,"math"),imm3,weights=imm3$w))$r.squared
nb<-R2n(c("oN","dN","factor(year)")); no<-R2n(c("oN","factor(year)")); nd<-R2n(c("dN","factor(year)"))
part <- data.frame(Model=c("Dosage + dummies","NIQ (single index)"),
  `Origin only`=c(ro,no), `Dest only`=c(rd,nd), Both=c(rf,nb),
  `Origin partial`=c(rf-rd,nb-nd), `Dest partial`=c(rf-ro,nb-no), check.names=FALSE)
flextable(part) |> colformat_double(j=2:6, digits=3) |> bold(j=c("Origin partial","Dest partial"), part="body") |> ft_style()

Model

Origin only

Dest only

Both

Origin partial

Dest partial

Dosage + dummies

0.228

0.279

0.320

0.041

0.091

NIQ (single index)

0.067

0.221

0.227

0.006

0.160

It’s migrant selection

Destination winning is explained by migrant selectivity. From the IAB Brain-Drain database we build a selection index = logit(high-educated % among emigrants) − logit(high-educated % of the origin population), and correlate it with the origin’s residual from NIQ:

r_no <- round(cov.wt(cbind(m$sel_logit,m$resid_noDest), wt=m$N, cor=TRUE)$cor[1,2],2)
r_w  <- round(cov.wt(cbind(m$sel_logit,m$resid_wDest),  wt=m$N, cor=TRUE)$cor[1,2],2)
ml <- m |> select(iso,N,sel_logit,resid_noDest,resid_wDest) |>
  pivot_longer(c(resid_noDest,resid_wDest), names_to="type", values_to="resid") |>
  mutate(panel=ifelse(type=="resid_noDest", sprintf("No destination control (r=%.2f)",r_no), sprintf("Destination-adjusted (r=%.2f)",r_w)))
ggplot(ml, aes(sel_logit, resid)) + geom_hline(yintercept=0,color="grey85") +
  geom_smooth(aes(weight=N), method="lm", se=FALSE, color="steelblue") +
  geom_point(aes(size=N), alpha=.6) + geom_text(aes(label=iso), size=2.4, vjust=-0.7, check_overlap=TRUE) +
  facet_wrap(~panel) + scale_size_area(max_size=9) +
  labs(title="Educational selection predicts the origin residual from NIQ",
       x="Educational selection (logit gap, IAB)", y="Origin residual from NIQ (PISA pts)", size="migrant N") +
  theme_minimal() + theme(legend.position="bottom")

A single model: origin NIQ + destination NIQ + corridor selection

Computing the selection index at the origin × destination level (it varies by corridor, hence not collinear with origin NIQ) lets us combine everything. Outcome on the IQ scale; all 7 subset models:

mods <- list(
  `O`=lm(mathIQ~oNIQ,cm4,weights=n), `D`=lm(mathIQ~dNIQ,cm4,weights=n), `S`=lm(mathIQ~corr_logit,cm4,weights=n),
  `O+D`=lm(mathIQ~oNIQ+dNIQ,cm4,weights=n), `O+S`=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n),
  `D+S`=lm(mathIQ~dNIQ+corr_logit,cm4,weights=n), `O+D+S`=lm(mathIQ~oNIQ+dNIQ+corr_logit,cm4,weights=n))
modelsummary(mods, output="flextable",
   coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (O)", dNIQ="Destination NIQ (D)", corr_logit="Selection (S)"),
   gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
   notes=sprintf("Corridor-cell OLS, n-weighted, n = %d corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection.", nrow(cm4))) |> ft_style()

O

D

S

O+D

O+S

D+S

O+D+S

Intercept

65.446***

-85.871*

101.867***

-111.473**

59.919***

-12.791

-35.500

(8.306)

(41.194)

(0.977)

(38.845)

(5.262)

(34.175)

(26.253)

Origin NIQ (O)

0.341***

0.330***

0.473***

0.457***

(0.091)

(0.083)

(0.059)

(0.055)

Destination NIQ (D)

1.855***

1.812***

1.162**

0.982***

(0.419)

(0.390)

(0.346)

(0.265)

Selection (S)

4.008***

4.583***

3.580***

4.201***

(0.484)

(0.381)

(0.477)

(0.372)

Adj. R²

0.118

0.162

0.413

0.275

0.649

0.470

0.691

* p < 0.05, ** p < 0.01, *** p < 0.001

Corridor-cell OLS, n-weighted, n = 97 corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection.

O+S (origin NIQ modified by selection) reaches almost the full model. Its fit, corridor by corridor:

cm4$fit <- predict(mods[["O+S"]])
ggplot(cm4, aes(fit, mathIQ)) + geom_abline(slope=1,intercept=0,linetype=2,color="grey70") +
  geom_point(aes(size=n), alpha=.45, color="steelblue") +
  geom_text(aes(label=paste0(origin_iso,"→",destination_iso)), size=2.1, vjust=-0.6, check_overlap=TRUE) +
  scale_size_area(max_size=10) +
  labs(title="O+S model: fitted vs observed corridor IQ", x="Fitted IQ (origin NIQ + selection)", y="Observed corridor IQ", size="migrant N") +
  theme_minimal() + theme(legend.position="bottom")

Outliers: Americans, Filipinos, Malaysians

usd <- examples$USA |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(usd, caption="US-origin by destination (Mexico = returnees; inverse-variance weighting of the homogeneous low Mexico cell pulled the diaspora mean to ~440).")
US-origin by destination (Mexico = returnees; inverse-variance weighting of the homogeneous low Mexico cell pulled the diaspora mean to ~440).
Destination N Math IQ
USA 742 462 93.7

PISA gives no ethnicity, but it gives home language. Filipinos abroad speak English at home far more than other Asian origins — but only ~7% do inside the Philippines, so it reflects selection + the Philippines’ English-medium schooling, not ethnicity:

knitr::kable(english_by_origin, caption="English at home (%) by origin, immigrants in the same destinations")
English at home (%) by origin, immigrants in the same destinations
origin n pct_English
Philippines 962 56
India 880 38
China 5002 4
Turkey 721 1
knitr::kable(phl_homeland_lang, caption="Home language inside the Philippines (PISA 2022) — English only 6.9%")
Home language inside the Philippines (PISA 2022) — English only 6.9%
lang pct
Tagalog 42.8
Sinugbuanong Binisaya/Cebuano 11.9
Another language (PHL) 10.0
English 6.9
Iloko 5.7
Bikol 4.6
Hiligaynon 4.3
Waray 2.5

Malaysians are the opposite — an ethnic selection (Chinese/Indian minorities leaving under bumiputera policy): 39% speak Chinese at home (vs ~23% of Malaysia) and only 21% Malay (vs ~70%):

knitr::kable(mys_lang |> select(language=lang, n, pct), caption="Malaysian-origin home language")
Malaysian-origin home language
language n pct
Chinese 72 36
English 64 32
Malay 42 21
Another language (BRN) 14 7
Mandarin 5 2
Another language (AUS) 1 0
Cantonese 1 0

Math vs other domains, and dimensionality

The O+S model across outcomes — math is actually the weakest fit (origin ability loads on math, selection loads more on reading/science):

md <- list(Math=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n), Reading=lm(readIQ~oNIQ+corr_logit,cm4,weights=n),
           Science=lm(scieIQ~oNIQ+corr_logit,cm4,weights=n), Composite=lm(compIQ~oNIQ+corr_logit,cm4,weights=n))
modelsummary(md, output="flextable", coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (IQ/IQ)", corr_logit="Selection (IQ/logit)"),
   gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
   notes="O+S model by outcome domain. n = 97 corridors.") |> ft_style()

Math

Reading

Science

Composite

Intercept

59.919***

70.402***

62.547***

64.317***

(5.262)

(4.901)

(5.075)

(4.960)

Origin NIQ (IQ/IQ)

0.473***

0.358***

0.452***

0.427***

(0.059)

(0.055)

(0.057)

(0.055)

Selection (IQ/logit)

4.583***

5.333***

5.651***

5.189***

(0.381)

(0.355)

(0.368)

(0.359)

Adj. R²

0.649

0.713

0.731

0.708

* p < 0.05, ** p < 0.01, *** p < 0.001

O+S model by outcome domain. n = 97 corridors.

The three “domains” correlate ~0.89 within a single country (German natives), i.e. they are effectively one scholastic g factor (cf. Pokropek et al. 2022):

knitr::kable(cor_DEU, caption="PISA domain correlations, German natives (mean r = 0.89)")
PISA domain correlations, German natives (mean r = 0.89)
math read scie
math 1.000 0.854 0.923
read 0.854 1.000 0.891
scie 0.923 0.891 1.000

Conclusions

  • PISA worldwide can be used to study origin-country effects, but coverage is sparse (~2% of origin × destination pairs reach n ≥ 25), so interpretation is limited.
  • Destination beats origin as a predictor of migrant PISA — but this is driven by migrant selectivity, which an independent education-based selection index recovers directly.
  • A model of origin NIQ + corridor selection (O+S) explains ~66–73% of variation across domains (~70% on average), almost matching the full model.
  • The three PISA domains are effectively a single g factor (mean r ≈ 0.89).