Overview

Brazil publishes large-scale assessment microdata via INEP (the federal education research agency). The main tests are:

Test Years used Approx age Sampling Subjects
ENEM 2016–2024 ~17–18 Voluntary, 3–5M/year CN/CH/LC/MT + redação
SAEB 2017, 2019, 2021, 2023 ~10, ~14, ~17 Census-based LP, MT
ENADE 2017–2019, 2021–2023 ~22–24 (undergrads) All graduating in selected courses FG + CE (3-year cycle)
ANA 2014 ~8 Census Reading, math
ENCCEJA 2017–2024 15+ / 18+ Adult equivalency Various

Final muni- and state-level estimates use ENEM 2016–2024 only, inverse-variance weighted across years (subject “g” = 4-subject composite per muni-year).

Anchoring: Brazil = 84 IQ, within-race SD = 15. Conversion factor k = 15.35 IQ per z. (See methodology section for caveats; ENEM is a selected sample, see §3.)

1. Race in Brazil

1.1 Census 2022 self-ID distribution

ma <- read_csv("data/municipios_censo2022_ancestry.csv", show_col_types = FALSE)
shares <- with(ma, c(
  "Parda (brown / mixed)" = weighted.mean(pct_parda,    pop_total, na.rm=TRUE),
  "Branca (white)"        = weighted.mean(pct_branca,   pop_total, na.rm=TRUE),
  "Preta (Black)"         = weighted.mean(pct_preta,    pop_total, na.rm=TRUE),
  "Indígena (Indigenous)" = weighted.mean(pct_indigena, pop_total, na.rm=TRUE),
  "Amarela (East Asian)"  = weighted.mean(pct_amarela,  pop_total, na.rm=TRUE)
))
df_shares <- tibble(label = factor(names(shares), levels = names(shares)),
                    share = round(shares, 2))
race_cols <- c("Branca (white)" = "#3A7CBF",
               "Parda (brown / mixed)" = "#C4823F",
               "Preta (Black)" = "#3A3A3A",
               "Amarela (East Asian)" = "#D8A93A",
               "Indígena (Indigenous)" = "#3F8B4A")
ggplot(df_shares, aes(x = share, y = forcats::fct_rev(label), fill = label)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", share)), hjust = -0.2, size = 4) +
  scale_x_continuous(limits = c(0, max(df_shares$share) * 1.18),
                     expand = expansion(mult = c(0, 0.05)),
                     labels = function(x) paste0(x, "%")) +
  scale_fill_manual(values = race_cols, guide = "none") +
  labs(title = "Brazilian self-identified race shares, 2022 census",
       subtitle = "Population-weighted across 5,570 municipalities; n ≈ 203 million",
       x = "Share of population", y = NULL)

1.2 Where do the groups live?

knitr::include_graphics(c("figs/map_share_white.png",
                          "figs/map_share_brown.png",
                          "figs/map_share_black.png",
                          "figs/map_share_yellow.png",
                          "figs/map_share_indigenous.png"))
Self-ID race shares per municipality, 2022 census.Self-ID race shares per municipality, 2022 census.Self-ID race shares per municipality, 2022 census.Self-ID race shares per municipality, 2022 census.Self-ID race shares per municipality, 2022 census.

Self-ID race shares per municipality, 2022 census.

2. Cognitive ability and SES per municipality

2.1 Municipal IQ from ENEM (2016–2024 pooled)

ENEM-only commune estimates: each (muni × year × subject = “g”) cell z-scored against the national distribution, then inverse-variance weighted across years per muni. Anchor: Brazil pop-weighted mean = 84 IQ; within-race SD = 15.

knitr::include_graphics(c("figs/map_iq_municipality.png",
                          "figs/map_iq_state.png"))
ENEM-derived IQ, anchored to Brazil = 84 with within-race SD = 15.ENEM-derived IQ, anchored to Brazil = 84 with within-race SD = 15.

ENEM-derived IQ, anchored to Brazil = 84 with within-race SD = 15.

2.2 Socioeconomic status (S factor)

S factor extracted from ENEM student questionnaire (parental education, household income, etc.), inverse-variance weighted across years per muni. Standardized to Brazil pop mean = 0, SD = 1.

knitr::include_graphics(c("figs/map_S_muni_enem.png",
                          "figs/map_S_state_enem.png"))
ENEM-derived S factor, standardized.ENEM-derived S factor, standardized.

ENEM-derived S factor, standardized.

2.3 Human Development Index (2010 IDHM)

The Atlas Brasil municipal HDI for 2010 — most recent vintage available at the muni level (Atlas hasn’t released a post-2010 update yet).

knitr::include_graphics(c("figs/map_hdi_municipality.png",
                          "figs/map_hdi_state.png"))
HDI 2010 from UNDP/IPEA/FJP Atlas Brasil.HDI 2010 from UNDP/IPEA/FJP Atlas Brasil.

HDI 2010 from UNDP/IPEA/FJP Atlas Brasil.

2.4 The IQ / S / HDI triad

knitr::include_graphics("figs/scatter_S_vs_HDI.png")
ENEM-derived S vs IDHM 2010.

ENEM-derived S vs IDHM 2010.

S and HDI correlate r ≈ 0.89 at the muni level — they’re alternative measurements of the same latent muni-developmental factor, with HDI being 15 years out-of-date.

3. Branca share and outcomes

knitr::include_graphics(c("figs/scatter_white_vs_IQ.png",
                          "figs/scatter_white_vs_S.png",
                          "figs/scatter_white_vs_HDI.png"))
Muni Branca share vs three developmental indicators.Muni Branca share vs three developmental indicators.Muni Branca share vs three developmental indicators.

Muni Branca share vs three developmental indicators.

4. Correlations and ecological regressions

4.1 Muni-level correlation matrix

knitr::include_graphics("figs/cor_matrix_selfid_iq_S_HDI.png")
Lower triangle: pop-weighted; upper triangle: unweighted.

Lower triangle: pop-weighted; upper triangle: unweighted.

The IQ / S / HDI triad correlates 0.89–0.90 — at the muni level these are essentially three measurements of one latent developmental factor.

4.2 OLS regressions (population-weighted)

knitr::include_graphics("figs/regression_table_iq_hdi_s.png")
OLS, race shares as proportions 0–1.

OLS, race shares as proportions 0–1.

4.3 Spatial-error model (queen contiguity, λ ≈ 0.85)

Moran’s I on OLS residuals = 0.49–0.53, p ≈ 0. Spatial-error model corrects for unobserved regional drivers.

knitr::include_graphics("figs/regression_table_iq_hdi_s_SEM.png")
Spatial error models (errorsarlm), pop-weighted.

Spatial error models (errorsarlm), pop-weighted.

The spatial correction:

  • Removes the spurious positive Preta and Amarela coefficients on HDI/S that OLS produced (those were geographic artefacts — Preta concentrated in coastal urban Bahia/Rio, Amarela concentrated in greater São Paulo).
  • Strengthens the Parda and Indígena negative coefficients.
  • IQ effect on HDI persists at +0.0128 per IQ point (in line with cross-country expectations).

Caveat: ecological models at muni level are brittle. The IQ / HDI redundancy means these regressions sit near a measurement-noise ceiling; “ancestry → IQ → HDI” mediation is not separable from “ancestry → general development factor” at this aggregation.

5. Race × subject means at the individual level

5.1 Distribution of self-ID by data source

knitr::include_graphics("figs/race_distribution_by_source_table.png")
Race shares by data source vs 2022 census. Last two columns show inflation ratio for the small categories.

Race shares by data source vs 2022 census. Last two columns show inflation ratio for the small categories.

Both ENEM and SAEB substantially over-report Amarela vs the census (5–7×). SAEB is worse than ENEM (~10–15% of SAEB students click randomly on the race question vs ~5–10% in ENEM). The over-reporting contaminates analyses of small-category groups.

5.2 ENEM race × subject (own sum-correct scoring)

knitr::include_graphics("figs/race_subject_zscores.png")
ENEM 2017–2023 sum-correct scores standardized to national mean = 0, SD = 1 per subject.

ENEM 2017–2023 sum-correct scores standardized to national mean = 0, SD = 1 per subject.

ENEM-aggregate gradient: Branca > Amarela > Parda ≈ Preta > Indígena. Note Amarela ≈ 0 — surprisingly, given the worldwide East-Asian-vs-European pattern.

5.3 SAEB race × subject (mandatory test, sum-correct scoring)

knitr::include_graphics("figs/saeb_race_subject_zscores.png")
SAEB 2019/2021/2023 by grade.

SAEB 2019/2021/2023 by grade.

5EF (~10yo) responses are noisy due to age-related misclick contamination. 3EM (~17yo) is the cleanest — same cohort as ENEM, mandatory administration. Comparison shows ENEM and SAEB give very similar gradients; ENEM’s selection bias is mild.

6. Resolving the Amarela paradox

If the national Amarela mean is dragged down by misclick contamination from non-Asian munis, restricting to munis with high real Amarela density should recover the East-Asian-vs-European cognitive advantage.

knitr::include_graphics("figs/amarela_vs_branca_by_density.png")
ENEM Math: Amarela vs Branca means by muni Amarela density.

ENEM Math: Amarela vs Branca means by muni Amarela density.

The crossover at ~0.5% Amarela is the smoking gun:

  • Below 0.25%: Amarela self-IDers score −0.48 SD vs Branca on math (misclick floor; mostly low-attention students randomly picking the Amarela option)
  • 1–2% density: Amarela +0.28 SD vs Branca
  • 2–5% density (real Nikkei centres: Mogi das Cruzes, Maringá, Tomé-Açu): +0.59 SD vs Branca, ≈ 9 IQ points — matches the worldwide East Asian advantage on math.

7. International comparison

7.1 State IQ vs admixture-project CA

knitr::include_graphics("figs/state_iq_vs_admixture.png")
Our ENEM-derived state IQ vs admixture-project CA estimates.

Our ENEM-derived state IQ vs admixture-project CA estimates.

State-level r = 0.85 — our ENEM-derived state IQs match independent national-IQ-style estimates well.

7.2 Top southern Brazilian munis vs Southern European countries (HDI 2010)

top_south <- tribble(
  ~Place,                ~`IQ (our scale)`, ~`HDI 2010`,
  "Valinhos (SP)",                   96.6,        0.819,
  "Florianópolis (SC)",              93.8,        0.847,
  "São Caetano do Sul (SP)",         93.0,        0.862,
  "Lajeado (RS)",                    93.9,        0.778,
  "Curitiba (PR)",                   91.6,        0.823,
  "Porto Alegre (RS)",               91.1,        0.805,
  "—",                               NA,          NA,
  "Italy",                           97,          0.880,
  "Greece",                          93,          0.874,
  "Spain",                           96,          0.868,
  "Malta",                           95,          0.862,
  "Cyprus",                          91,          0.859,
  "Portugal",                        94,          0.831,
  "Brazil (national)",               84,          0.722
)
top_south

Top-tier Brazilian Branca-majority munis (~93–97 IQ, HDI 0.78–0.86 in 2010) sit at IQ-matched parity with Cyprus and Portugal on HDI. Mid-tier “rich southern” munis are 0.03–0.05 below their European peers in 2010 vintage. Projecting forward (Brazil HDI grew ~+0.04 from 2010 to 2022), the highest Brazilian munis are essentially at Spain/Italy levels in current vintage.

7.3 Skin color (LAPOP, state-level)

state_color <- read_csv("data_processed/state_lapop_color.csv", show_col_types = FALSE)
cor_LAPOP_IQ <- cor(state_color$LAPOP_color, state_color$IQ_enem, use = "pairwise.complete.obs")
state_color |> select(state, LAPOP_color, IQ_enem, HDI_latest, EUR_admx, AFR_admx) |>
  arrange(LAPOP_color)

State LAPOP skin-color (interviewer-rated, 0–10 palette, lighter = lower) correlates r = -0.66 with state IQ. Sample size per state is too thin for muni-level mediation; a pooled-LAPOP follow-up study would resolve this.

8. Key takeaways

  1. ENEM is the cleanest base for muni cognitive estimates — 9 years, 5,548 munis covered, internal cross-year r > 0.94. SAEB has selection-similar performance but lower per-cell precision.

  2. At the muni level, IQ / S / HDI are all measurements of one latent factor (cross-correlations 0.89–0.90). Race-mediation analyses should be reported with this redundancy disclosed.

  3. Spatial autocorrelation matters — λ ≈ 0.85 in error model. Without spatial correction, OLS produces spurious positive Preta/Amarela coefficients via the urban-cluster geography.

  4. Self-ID race is contaminated by random clicking — 5–15% in SAEB, 5–10% in ENEM. Small categories (Amarela, Indígena) are inflated 5–7× over census; their group means are heavily biased by the contamination.

  5. The Amarela “paradox” resolves with muni-density bucketing. In real Nikkei communities, Asian-Brazilians outperform Brancas by +0.59 SD on math, matching the worldwide East Asian advantage. The national-aggregate “Amarela ≤ Branca” is a Simpson’s paradox driven by misclick contamination outside Nikkei zones.

  6. Top-tier Brazilian Branca-heavy munis are essentially at Southern European HDI levels at IQ-matched comparison.

Methodology / files

Data prep scripts live in scripts/:

Script Output(s)
enem.R per-year ENEM aggregates, panel, commune estimates
saeb.R per-year SAEB aggregates
enade.R per-year ENADE aggregates
ana.R ANA 2014 aggregates
encceja.R ENCCEJA aggregates
prova_brasil.R Prova Brasil 2011 aggregates
correlation_matrix.R cross-test/year correlation matrix
idhm_validation.R HDI vs ENEM/SAEB validation
race_gaps.R per-test race × sex × year aggregates
score_distributions.R score density / NU_NOTA-bug audits
enem_key_audit.R, enem_recover_keys.R answer-key recovery
maps_iq.R IQ + HDI maps
maps_selfid.R race share maps
regressions_iq_hdi_s.R OLS + SEM regressions, saved tables
race_distribution_by_source.R race-by-source distribution + click-rate
race_subject_scores_enem.R ENEM race × subject sum-correct scoring
race_subject_scores_saeb.R SAEB race × subject sum-correct scoring
amarela_density_buckets.R Asian-vs-White by muni density

Run all from project root:

Rscript run_all.R   # orchestrates the pipeline (data → tables → figures)

The Rmd above (analysis.Rmd) renders the writeup from the prepped outputs.