Migrant performance on PISA worldwide

We treat the world as a migration matrix of performance: rows = origin countries, columns = destination countries, each cell = mean PISA math of that origin’s children living in that destination (pooled cycles 2003–2022, students whose two parents share an origin). PISA scores are also shown on the IQ scale (500 → 100, SD 90, so 6 PISA points = 1 IQ point).

The matrix is mostly empty

Due to sample sizes, privacy suppression of origin country, unequal population sizes, and differential migration, only a tiny fraction of origin × destination pairs are populated.

allO <- sort(unique(coverage_counts$origin)); allD <- sort(unique(coverage_counts$dest))
grid <- expand_grid(origin=allO, destination=allD) |>
  left_join(coverage_counts, by=c("origin"="origin","destination"="dest")) |>
  mutate(filled = !is.na(n) & n>=25)
ordO <- grid |> group_by(origin) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(origin)
ordD <- grid |> group_by(destination) |> summarise(s=sum(filled)) |> arrange(desc(s)) |> pull(destination)
grid <- grid |> mutate(origin=factor(origin,levels=rev(ordO)), destination=factor(destination,levels=ordD))
pct <- round(100*sum(grid$filled)/nrow(grid),1)
ggplot(grid, aes(destination, origin, fill=filled)) + geom_tile() +
  scale_fill_manual(values=c(`FALSE`="firebrick3",`TRUE`="forestgreen"), labels=c("no data (n<25)","n ≥ 25"), name=NULL) +
  labs(title=sprintf("Origin × destination coverage: %d of %d cells (%.1f%%)", sum(grid$filled), nrow(grid), pct),
       x="Destination (most data → least)", y="Origin (most data → least)") +
  theme_minimal() + theme(axis.text=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank(), legend.position="bottom")

Some example rows (PISA math by destination, in order of sample size):

ex_tab <- function(x) x |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(ex_tab(examples$Turkey), caption="Turks")

Turks
Destination	N	Math	IQ
AUT	1478	413	85.5
DEU	1286	427	87.8
DNK	1285	424	87.3
CHE	959	447	91.2
BEL	858	430	88.3
NLD	348	460	93.3
FIN	111	416	86.0
TUR	64	395	82.5
LIE	46	471	95.2

knitr::kable(ex_tab(examples$India), caption="Indians")

Indians
Destination	N	Math	IQ
AUS	1114	549	108.2
CAN	883	515	102.5
GBR	88	510	101.7
BRN	80	509	101.5
NZL	37	527	104.5

knitr::kable(ex_tab(examples$China), caption="Chinese")

Chinese
Destination	N	Math	IQ
MAC	17209	542	107.0
HKG	11797	549	108.2
AUS	1258	583	113.8
CAN	878	603	117.2
NZL	794	577	112.8
KAZ	116	433	88.8
FIN	68	488	98.0
PAN	60	401	83.5
NLD	42	519	103.2
PRT	41	507	101.2

Model 1: do migrants just carry their origin level?

The strict global-hereditarian model says migrants perform at their origin’s level wherever they go, so we can predict diaspora performance from origin national IQ. Countries that send migrants but don’t run PISA (red) get a home value imputed from NIQ.

od <- origin_level |> mutate(grp=ifelse(source=="PISA","has PISA home","NIQ-imputed"))
r <- round(cor(od$imm, od$NIQ, use="complete"),2)
ggplot(od, aes(NIQ, imm)) + geom_smooth(method="lm", se=FALSE, color="grey70") +
  geom_point(aes(size=N, color=grp), alpha=.7) +
  geom_text(aes(label=iso, color=grp), size=2.6, vjust=-0.8, check_overlap=TRUE, show.legend=FALSE) +
  scale_color_manual(values=c("has PISA home"="grey25","NIQ-imputed"="firebrick"), name=NULL) +
  labs(title=sprintf("Diaspora competence vs origin NIQ (r = %.2f)", r),
       x="Origin NIQ", y="Diaspora mean (PISA composite)", size="migrant N") + theme_minimal() + theme(legend.position="bottom")

India is the big outlier — Indians abroad far outperform Indians at home, hinting migrants aren’t representative.

Model 2: origin vs destination

score ~ origin_NIQ + destination_NIQ + year (standardized betas; both predictors on the IQ scale):

wsd <- function(x,w) sqrt(sum(w*(x-weighted.mean(x,w))^2)/sum(w)); z <- function(x,w) (x-weighted.mean(x,w))/wsd(x,w)
i3 <- imm3 |> mutate(zm=z(math,w), zo=z(oN,w), zd=z(dN,w))
ms <- list(`(1) Origin`=lm(zm~zo+factor(year),i3,weights=w), `(2) Destination`=lm(zm~zd+factor(year),i3,weights=w),
           `(3) Both`=lm(zm~zo+zd+factor(year),i3,weights=w))
modelsummary(ms, output="flextable", coef_map=c(zo="Origin NIQ (std β)", zd="Destination NIQ (std β)"),
   coef_omit="Inter|year", gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3),
   stars=c(`*`=.05,`**`=.01,`***`=.001), notes="Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999.") |> ft_style()

	(1) Origin	(2) Destination	(3) Both
Origin NIQ (std β)	0.253***		0.088***
	(0.003)		(0.003)
Destination NIQ (std β)		0.480***	0.451***
		(0.003)	(0.003)
Adj. R²	0.067	0.221	0.227
* p < 0.05, p < 0.01, * p < 0.001
Individual-level, W_FSTUWT-weighted, + year FE. n = 134,999.

Destination NIQ is the far stronger predictor. The same holds when origin (dosage) and destination (dummies) are left free — comparing block partial R²:

oc <- oc_ref
R2d <- function(r) summary(lm(reformulate(r,"math"),dat3,weights=dat3$w))$r.squared
rf<-R2d(c(oc,"destination","factor(year)")); ro<-R2d(c(oc,"factor(year)")); rd<-R2d(c("destination","factor(year)")); ry<-R2d("factor(year)")
R2n <- function(r) summary(lm(reformulate(r,"math"),imm3,weights=imm3$w))$r.squared
nb<-R2n(c("oN","dN","factor(year)")); no<-R2n(c("oN","factor(year)")); nd<-R2n(c("dN","factor(year)"))
part <- data.frame(Model=c("Dosage + dummies","NIQ (single index)"),
  `Origin only`=c(ro,no), `Dest only`=c(rd,nd), Both=c(rf,nb),
  `Origin partial`=c(rf-rd,nb-nd), `Dest partial`=c(rf-ro,nb-no), check.names=FALSE)
flextable(part) |> colformat_double(j=2:6, digits=3) |> bold(j=c("Origin partial","Dest partial"), part="body") |> ft_style()

Model	Origin only	Dest only	Both	Origin partial	Dest partial
Dosage + dummies	0.228	0.279	0.320	0.041	0.091
NIQ (single index)	0.067	0.221	0.227	0.006	0.160

It’s migrant selection

Destination winning is explained by migrant selectivity. From the IAB Brain-Drain database we build a selection index = logit(high-educated % among emigrants) − logit(high-educated % of the origin population), and correlate it with the origin’s residual from NIQ:

r_no <- round(cov.wt(cbind(m$sel_logit,m$resid_noDest), wt=m$N, cor=TRUE)$cor[1,2],2)
r_w  <- round(cov.wt(cbind(m$sel_logit,m$resid_wDest),  wt=m$N, cor=TRUE)$cor[1,2],2)
ml <- m |> select(iso,N,sel_logit,resid_noDest,resid_wDest) |>
  pivot_longer(c(resid_noDest,resid_wDest), names_to="type", values_to="resid") |>
  mutate(panel=ifelse(type=="resid_noDest", sprintf("No destination control (r=%.2f)",r_no), sprintf("Destination-adjusted (r=%.2f)",r_w)))
ggplot(ml, aes(sel_logit, resid)) + geom_hline(yintercept=0,color="grey85") +
  geom_smooth(aes(weight=N), method="lm", se=FALSE, color="steelblue") +
  geom_point(aes(size=N), alpha=.6) + geom_text(aes(label=iso), size=2.4, vjust=-0.7, check_overlap=TRUE) +
  facet_wrap(~panel) + scale_size_area(max_size=9) +
  labs(title="Educational selection predicts the origin residual from NIQ",
       x="Educational selection (logit gap, IAB)", y="Origin residual from NIQ (PISA pts)", size="migrant N") +
  theme_minimal() + theme(legend.position="bottom")

A single model: origin NIQ + destination NIQ + corridor selection

Computing the selection index at the origin × destination level (it varies by corridor, hence not collinear with origin NIQ) lets us combine everything. Outcome on the IQ scale; all 7 subset models:

mods <- list(
  `O`=lm(mathIQ~oNIQ,cm4,weights=n), `D`=lm(mathIQ~dNIQ,cm4,weights=n), `S`=lm(mathIQ~corr_logit,cm4,weights=n),
  `O+D`=lm(mathIQ~oNIQ+dNIQ,cm4,weights=n), `O+S`=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n),
  `D+S`=lm(mathIQ~dNIQ+corr_logit,cm4,weights=n), `O+D+S`=lm(mathIQ~oNIQ+dNIQ+corr_logit,cm4,weights=n))
modelsummary(mods, output="flextable",
   coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (O)", dNIQ="Destination NIQ (D)", corr_logit="Selection (S)"),
   gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
   notes=sprintf("Corridor-cell OLS, n-weighted, n = %d corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection.", nrow(cm4))) |> ft_style()

	O	D	S	O+D	O+S	D+S	O+D+S
Intercept	65.446***	-85.871*	101.867***	-111.473**	59.919***	-12.791	-35.500
	(8.306)	(41.194)	(0.977)	(38.845)	(5.262)	(34.175)	(26.253)
Origin NIQ (O)	0.341***			0.330***	0.473***		0.457***
	(0.091)			(0.083)	(0.059)		(0.055)
Destination NIQ (D)		1.855***		1.812***		1.162**	0.982***
		(0.419)		(0.390)		(0.346)	(0.265)
Selection (S)			4.008***		4.583***	3.580***	4.201***
			(0.484)		(0.381)	(0.477)	(0.372)
Adj. R²	0.118	0.162	0.413	0.275	0.649	0.470	0.691
* p < 0.05, p < 0.01, * p < 0.001
Corridor-cell OLS, n-weighted, n = 97 corridors (constant). O & D in IQ/IQ; S in IQ/logit. Predicted IQ = Intercept + O·oNIQ + D·dNIQ + S·selection.

O+S (origin NIQ modified by selection) reaches almost the full model. Its fit, corridor by corridor:

cm4$fit <- predict(mods[["O+S"]])
ggplot(cm4, aes(fit, mathIQ)) + geom_abline(slope=1,intercept=0,linetype=2,color="grey70") +
  geom_point(aes(size=n), alpha=.45, color="steelblue") +
  geom_text(aes(label=paste0(origin_iso,"→",destination_iso)), size=2.1, vjust=-0.6, check_overlap=TRUE) +
  scale_size_area(max_size=10) +
  labs(title="O+S model: fitted vs observed corridor IQ", x="Fitted IQ (origin NIQ + selection)", y="Observed corridor IQ", size="migrant N") +
  theme_minimal() + theme(legend.position="bottom")

Outliers: Americans, Filipinos, Malaysians

usd <- examples$USA |> mutate(IQ=toIQ(math)) |> rename(Destination=dest, N=n, Math=math)
knitr::kable(usd, caption="US-origin by destination (Mexico = returnees; inverse-variance weighting of the homogeneous low Mexico cell pulled the diaspora mean to ~440).")

US-origin by destination (Mexico = returnees; inverse-variance weighting of the homogeneous low Mexico cell pulled the diaspora mean to ~440).
Destination	N	Math	IQ
USA	742	462	93.7

PISA gives no ethnicity, but it gives home language. Filipinos abroad speak English at home far more than other Asian origins — but only ~7% do inside the Philippines, so it reflects selection + the Philippines’ English-medium schooling, not ethnicity:

knitr::kable(english_by_origin, caption="English at home (%) by origin, immigrants in the same destinations")

English at home (%) by origin, immigrants in the same destinations
origin	n	pct_English
Philippines	962	56
India	880	38
China	5002	4
Turkey	721	1

knitr::kable(phl_homeland_lang, caption="Home language inside the Philippines (PISA 2022) — English only 6.9%")

Home language inside the Philippines (PISA 2022) — English only 6.9%
lang	pct
Tagalog	42.8
Sinugbuanong Binisaya/Cebuano	11.9
Another language (PHL)	10.0
English	6.9
Iloko	5.7
Bikol	4.6
Hiligaynon	4.3
Waray	2.5

Malaysians are the opposite — an ethnic selection (Chinese/Indian minorities leaving under bumiputera policy): 39% speak Chinese at home (vs ~23% of Malaysia) and only 21% Malay (vs ~70%):

knitr::kable(mys_lang |> select(language=lang, n, pct), caption="Malaysian-origin home language")

Malaysian-origin home language
language	n	pct
Chinese	72	36
English	64	32
Malay	42	21
Another language (BRN)	14	7
Mandarin	5	2
Another language (AUS)	1	0
Cantonese	1	0

Math vs other domains, and dimensionality

The O+S model across outcomes — math is actually the weakest fit (origin ability loads on math, selection loads more on reading/science):

md <- list(Math=lm(mathIQ~oNIQ+corr_logit,cm4,weights=n), Reading=lm(readIQ~oNIQ+corr_logit,cm4,weights=n),
           Science=lm(scieIQ~oNIQ+corr_logit,cm4,weights=n), Composite=lm(compIQ~oNIQ+corr_logit,cm4,weights=n))
modelsummary(md, output="flextable", coef_map=c(`(Intercept)`="Intercept", oNIQ="Origin NIQ (IQ/IQ)", corr_logit="Selection (IQ/logit)"),
   gof_map=tibble::tribble(~raw,~clean,~fmt,"adj.r.squared","Adj. R²",3), stars=c(`*`=.05,`**`=.01,`***`=.001),
   notes="O+S model by outcome domain. n = 97 corridors.") |> ft_style()

	Math	Reading	Science	Composite
Intercept	59.919***	70.402***	62.547***	64.317***
	(5.262)	(4.901)	(5.075)	(4.960)
Origin NIQ (IQ/IQ)	0.473***	0.358***	0.452***	0.427***
	(0.059)	(0.055)	(0.057)	(0.055)
Selection (IQ/logit)	4.583***	5.333***	5.651***	5.189***
	(0.381)	(0.355)	(0.368)	(0.359)
Adj. R²	0.649	0.713	0.731	0.708
* p < 0.05, p < 0.01, * p < 0.001
O+S model by outcome domain. n = 97 corridors.

The three “domains” correlate ~0.89 within a single country (German natives), i.e. they are effectively one scholastic g factor (cf. Pokropek et al. 2022):

knitr::kable(cor_DEU, caption="PISA domain correlations, German natives (mean r = 0.89)")

PISA domain correlations, German natives (mean r = 0.89)
	math	read	scie
math	1.000	0.854	0.923
read	0.854	1.000	0.891
scie	0.923	0.891	1.000

Conclusions

PISA worldwide can be used to study origin-country effects, but coverage is sparse (~2% of origin × destination pairs reach n ≥ 25), so interpretation is limited.
Destination beats origin as a predictor of migrant PISA — but this is driven by migrant selectivity, which an independent education-based selection index recovers directly.
A model of origin NIQ + corridor selection (O+S) explains ~66–73% of variation across domains (~70% on average), almost matching the full model.
The three PISA domains are effectively a single g factor (mean r ≈ 0.89).