Overview

Brazil publishes large-scale assessment microdata via INEP (the federal education research agency). The main tests are:

Test	Years used	Approx age	Sampling	Subjects
ENEM	2016–2024	~17–18	Voluntary, 3–5M/year	CN/CH/LC/MT + redação
SAEB	2017, 2019, 2021, 2023	~10, ~14, ~17	Census-based	LP, MT
ENADE	2017–2019, 2021–2023	~22–24 (undergrads)	All graduating in selected courses	FG + CE (3-year cycle)
ANA	2014	~8	Census	Reading, math
ENCCEJA	2017–2024	15+ / 18+	Adult equivalency	Various

Final muni- and state-level estimates use ENEM 2016–2024 only, inverse-variance weighted across years (subject “g” = 4-subject composite per muni-year).

Anchoring: Brazil = 84 IQ, within-race SD = 15. Conversion factor k = 15.35 IQ per z. (See methodology section for caveats; ENEM is a selected sample, see §3.)

1. Race in Brazil

1.1 Census 2022 self-ID distribution

ma <- read_csv("data/municipios_censo2022_ancestry.csv", show_col_types = FALSE)
shares <- with(ma, c(
  "Parda (brown / mixed)" = weighted.mean(pct_parda,    pop_total, na.rm=TRUE),
  "Branca (white)"        = weighted.mean(pct_branca,   pop_total, na.rm=TRUE),
  "Preta (Black)"         = weighted.mean(pct_preta,    pop_total, na.rm=TRUE),
  "Indígena (Indigenous)" = weighted.mean(pct_indigena, pop_total, na.rm=TRUE),
  "Amarela (East Asian)"  = weighted.mean(pct_amarela,  pop_total, na.rm=TRUE)
))
df_shares <- tibble(label = factor(names(shares), levels = names(shares)),
                    share = round(shares, 2))
race_cols <- c("Branca (white)" = "#3A7CBF",
               "Parda (brown / mixed)" = "#C4823F",
               "Preta (Black)" = "#3A3A3A",
               "Amarela (East Asian)" = "#D8A93A",
               "Indígena (Indigenous)" = "#3F8B4A")
ggplot(df_shares, aes(x = share, y = forcats::fct_rev(label), fill = label)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", share)), hjust = -0.2, size = 4) +
  scale_x_continuous(limits = c(0, max(df_shares$share) * 1.18),
                     expand = expansion(mult = c(0, 0.05)),
                     labels = function(x) paste0(x, "%")) +
  scale_fill_manual(values = race_cols, guide = "none") +
  labs(title = "Brazilian self-identified race shares, 2022 census",
       subtitle = "Population-weighted across 5,570 municipalities; n ≈ 203 million",
       x = "Share of population", y = NULL)

1.2 Where do the groups live?

knitr::include_graphics(c("figs/map_share_white.png",
                          "figs/map_share_brown.png",
                          "figs/map_share_black.png",
                          "figs/map_share_yellow.png",
                          "figs/map_share_indigenous.png"))

Self-ID race shares per municipality, 2022 census.

Self-ID race shares per municipality, 2022 census.

2. Cognitive ability and SES per municipality

2.1 Municipal IQ from ENEM (2016–2024 pooled)

ENEM-only commune estimates: each (muni × year × subject = “g”) cell z-scored against the national distribution, then inverse-variance weighted across years per muni. Anchor: Brazil pop-weighted mean = 84 IQ; within-race SD = 15.

knitr::include_graphics(c("figs/map_iq_municipality.png",
                          "figs/map_iq_state.png"))

ENEM-derived IQ, anchored to Brazil = 84 with within-race SD = 15.

2.2 Socioeconomic status (S factor)

S factor extracted from ENEM student questionnaire (parental education, household income, etc.), inverse-variance weighted across years per muni. Standardized to Brazil pop mean = 0, SD = 1.

knitr::include_graphics(c("figs/map_S_muni_enem.png",
                          "figs/map_S_state_enem.png"))

ENEM-derived S factor, standardized.

2.3 Human Development Index (2010 IDHM)

The Atlas Brasil municipal HDI for 2010 — most recent vintage available at the muni level (Atlas hasn’t released a post-2010 update yet).

knitr::include_graphics(c("figs/map_hdi_municipality.png",
                          "figs/map_hdi_state.png"))

HDI 2010 from UNDP/IPEA/FJP Atlas Brasil.

2.4 The IQ / S / HDI triad

knitr::include_graphics("figs/scatter_S_vs_HDI.png")

ENEM-derived S vs IDHM 2010.

S and HDI correlate r ≈ 0.89 at the muni level — they’re alternative measurements of the same latent muni-developmental factor, with HDI being 15 years out-of-date.

3. Branca share and outcomes

knitr::include_graphics(c("figs/scatter_white_vs_IQ.png",
                          "figs/scatter_white_vs_S.png",
                          "figs/scatter_white_vs_HDI.png"))

Muni Branca share vs three developmental indicators.

Muni Branca share vs three developmental indicators.

4. Correlations and ecological regressions

4.1 Muni-level correlation matrix

knitr::include_graphics("figs/cor_matrix_selfid_iq_S_HDI.png")

Lower triangle: pop-weighted; upper triangle: unweighted.

The IQ / S / HDI triad correlates 0.89–0.90 — at the muni level these are essentially three measurements of one latent developmental factor.

4.2 OLS regressions (population-weighted)

knitr::include_graphics("figs/regression_table_iq_hdi_s.png")

OLS, race shares as proportions 0–1.

4.3 Spatial-error model (queen contiguity, λ ≈ 0.85)

Moran’s I on OLS residuals = 0.49–0.53, p ≈ 0. Spatial-error model corrects for unobserved regional drivers.

knitr::include_graphics("figs/regression_table_iq_hdi_s_SEM.png")

Spatial error models (errorsarlm), pop-weighted.

The spatial correction:

Removes the spurious positive Preta and Amarela coefficients on HDI/S that OLS produced (those were geographic artefacts — Preta concentrated in coastal urban Bahia/Rio, Amarela concentrated in greater São Paulo).
Strengthens the Parda and Indígena negative coefficients.
IQ effect on HDI persists at +0.0128 per IQ point (in line with cross-country expectations).

Caveat: ecological models at muni level are brittle. The IQ / HDI redundancy means these regressions sit near a measurement-noise ceiling; “ancestry → IQ → HDI” mediation is not separable from “ancestry → general development factor” at this aggregation.

5. Race × subject means at the individual level

5.1 Distribution of self-ID by data source

knitr::include_graphics("figs/race_distribution_by_source_table.png")

Race shares by data source vs 2022 census. Last two columns show inflation ratio for the small categories.

Both ENEM and SAEB substantially over-report Amarela vs the census (5–7×). SAEB is worse than ENEM (~10–15% of SAEB students click randomly on the race question vs ~5–10% in ENEM). The over-reporting contaminates analyses of small-category groups.

5.2 ENEM race × subject (own sum-correct scoring)

knitr::include_graphics("figs/race_subject_zscores.png")

ENEM 2017–2023 sum-correct scores standardized to national mean = 0, SD = 1 per subject.

ENEM-aggregate gradient: Branca > Amarela > Parda ≈ Preta > Indígena. Note Amarela ≈ 0 — surprisingly, given the worldwide East-Asian-vs-European pattern.

5.3 SAEB race × subject (mandatory test, sum-correct scoring)

knitr::include_graphics("figs/saeb_race_subject_zscores.png")

SAEB 2019/2021/2023 by grade.

5EF (~10yo) responses are noisy due to age-related misclick contamination. 3EM (~17yo) is the cleanest — same cohort as ENEM, mandatory administration. Comparison shows ENEM and SAEB give very similar gradients; ENEM’s selection bias is mild.

6. Resolving the Amarela paradox

If the national Amarela mean is dragged down by misclick contamination from non-Asian munis, restricting to munis with high real Amarela density should recover the East-Asian-vs-European cognitive advantage.

knitr::include_graphics("figs/amarela_vs_branca_by_density.png")

ENEM Math: Amarela vs Branca means by muni Amarela density.

The crossover at ~0.5% Amarela is the smoking gun:

Below 0.25%: Amarela self-IDers score −0.48 SD vs Branca on math (misclick floor; mostly low-attention students randomly picking the Amarela option)
1–2% density: Amarela +0.28 SD vs Branca
2–5% density (real Nikkei centres: Mogi das Cruzes, Maringá, Tomé-Açu): +0.59 SD vs Branca, ≈ 9 IQ points — matches the worldwide East Asian advantage on math.

7. International comparison

7.1 State IQ vs admixture-project CA

knitr::include_graphics("figs/state_iq_vs_admixture.png")

Our ENEM-derived state IQ vs admixture-project CA estimates.

State-level r = 0.85 — our ENEM-derived state IQs match independent national-IQ-style estimates well.

7.2 Top southern Brazilian munis vs Southern European countries (HDI 2010)

top_south <- tribble(
  ~Place,                ~`IQ (our scale)`, ~`HDI 2010`,
  "Valinhos (SP)",                   96.6,        0.819,
  "Florianópolis (SC)",              93.8,        0.847,
  "São Caetano do Sul (SP)",         93.0,        0.862,
  "Lajeado (RS)",                    93.9,        0.778,
  "Curitiba (PR)",                   91.6,        0.823,
  "Porto Alegre (RS)",               91.1,        0.805,
  "—",                               NA,          NA,
  "Italy",                           97,          0.880,
  "Greece",                          93,          0.874,
  "Spain",                           96,          0.868,
  "Malta",                           95,          0.862,
  "Cyprus",                          91,          0.859,
  "Portugal",                        94,          0.831,
  "Brazil (national)",               84,          0.722
)
top_south

Top-tier Brazilian Branca-majority munis (~93–97 IQ, HDI 0.78–0.86 in 2010) sit at IQ-matched parity with Cyprus and Portugal on HDI. Mid-tier “rich southern” munis are 0.03–0.05 below their European peers in 2010 vintage. Projecting forward (Brazil HDI grew ~+0.04 from 2010 to 2022), the highest Brazilian munis are essentially at Spain/Italy levels in current vintage.

7.3 Skin color (LAPOP, state-level)

state_color <- read_csv("data_processed/state_lapop_color.csv", show_col_types = FALSE)
cor_LAPOP_IQ <- cor(state_color$LAPOP_color, state_color$IQ_enem, use = "pairwise.complete.obs")
state_color |> select(state, LAPOP_color, IQ_enem, HDI_latest, EUR_admx, AFR_admx) |>
  arrange(LAPOP_color)

State LAPOP skin-color (interviewer-rated, 0–10 palette, lighter = lower) correlates r = -0.66 with state IQ. Sample size per state is too thin for muni-level mediation; a pooled-LAPOP follow-up study would resolve this.

8. Key takeaways

ENEM is the cleanest base for muni cognitive estimates — 9 years, 5,548 munis covered, internal cross-year r > 0.94. SAEB has selection-similar performance but lower per-cell precision.
At the muni level, IQ / S / HDI are all measurements of one latent factor (cross-correlations 0.89–0.90). Race-mediation analyses should be reported with this redundancy disclosed.
Spatial autocorrelation matters — λ ≈ 0.85 in error model. Without spatial correction, OLS produces spurious positive Preta/Amarela coefficients via the urban-cluster geography.
Self-ID race is contaminated by random clicking — 5–15% in SAEB, 5–10% in ENEM. Small categories (Amarela, Indígena) are inflated 5–7× over census; their group means are heavily biased by the contamination.
The Amarela “paradox” resolves with muni-density bucketing. In real Nikkei communities, Asian-Brazilians outperform Brancas by +0.59 SD on math, matching the worldwide East Asian advantage. The national-aggregate “Amarela ≤ Branca” is a Simpson’s paradox driven by misclick contamination outside Nikkei zones.
Top-tier Brazilian Branca-heavy munis are essentially at Southern European HDI levels at IQ-matched comparison.

Methodology / files

Data prep scripts live in scripts/:

Script	Output(s)
`enem.R`	per-year ENEM aggregates, panel, commune estimates
`saeb.R`	per-year SAEB aggregates
`enade.R`	per-year ENADE aggregates
`ana.R`	ANA 2014 aggregates
`encceja.R`	ENCCEJA aggregates
`prova_brasil.R`	Prova Brasil 2011 aggregates
`correlation_matrix.R`	cross-test/year correlation matrix
`idhm_validation.R`	HDI vs ENEM/SAEB validation
`race_gaps.R`	per-test race × sex × year aggregates
`score_distributions.R`	score density / NU_NOTA-bug audits
`enem_key_audit.R`, `enem_recover_keys.R`	answer-key recovery
`maps_iq.R`	IQ + HDI maps
`maps_selfid.R`	race share maps
`regressions_iq_hdi_s.R`	OLS + SEM regressions, saved tables
`race_distribution_by_source.R`	race-by-source distribution + click-rate
`race_subject_scores_enem.R`	ENEM race × subject sum-correct scoring
`race_subject_scores_saeb.R`	SAEB race × subject sum-correct scoring
`amarela_density_buckets.R`	Asian-vs-White by muni density

Run all from project root:

Rscript run_all.R   # orchestrates the pipeline (data → tables → figures)

The Rmd above (analysis.Rmd) renders the writeup from the prepped outputs.

Race, intelligence, and inequality in Brazil

Emil O. W. Kirkegaard

2026-05-07