Brazil publishes large-scale assessment microdata via INEP (the federal education research agency). The main tests are:
| Test | Years used | Approx age | Sampling | Subjects |
|---|---|---|---|---|
| ENEM | 2016–2024 | ~17–18 | Voluntary, 3–5M/year | CN/CH/LC/MT + redação |
| SAEB | 2017, 2019, 2021, 2023 | ~10, ~14, ~17 | Census-based | LP, MT |
| ENADE | 2017–2019, 2021–2023 | ~22–24 (undergrads) | All graduating in selected courses | FG + CE (3-year cycle) |
| ANA | 2014 | ~8 | Census | Reading, math |
| ENCCEJA | 2017–2024 | 15+ / 18+ | Adult equivalency | Various |
Final muni- and state-level estimates use ENEM 2016–2024 only, inverse-variance weighted across years (subject “g” = 4-subject composite per muni-year).
Anchoring: Brazil = 84 IQ, within-race SD = 15. Conversion factor k = 15.35 IQ per z. (See methodology section for caveats; ENEM is a selected sample, see §3.)
ma <- read_csv("data/municipios_censo2022_ancestry.csv", show_col_types = FALSE)
shares <- with(ma, c(
"Parda (brown / mixed)" = weighted.mean(pct_parda, pop_total, na.rm=TRUE),
"Branca (white)" = weighted.mean(pct_branca, pop_total, na.rm=TRUE),
"Preta (Black)" = weighted.mean(pct_preta, pop_total, na.rm=TRUE),
"Indígena (Indigenous)" = weighted.mean(pct_indigena, pop_total, na.rm=TRUE),
"Amarela (East Asian)" = weighted.mean(pct_amarela, pop_total, na.rm=TRUE)
))
df_shares <- tibble(label = factor(names(shares), levels = names(shares)),
share = round(shares, 2))
race_cols <- c("Branca (white)" = "#3A7CBF",
"Parda (brown / mixed)" = "#C4823F",
"Preta (Black)" = "#3A3A3A",
"Amarela (East Asian)" = "#D8A93A",
"Indígena (Indigenous)" = "#3F8B4A")
ggplot(df_shares, aes(x = share, y = forcats::fct_rev(label), fill = label)) +
geom_col(width = 0.7) +
geom_text(aes(label = sprintf("%.1f%%", share)), hjust = -0.2, size = 4) +
scale_x_continuous(limits = c(0, max(df_shares$share) * 1.18),
expand = expansion(mult = c(0, 0.05)),
labels = function(x) paste0(x, "%")) +
scale_fill_manual(values = race_cols, guide = "none") +
labs(title = "Brazilian self-identified race shares, 2022 census",
subtitle = "Population-weighted across 5,570 municipalities; n ≈ 203 million",
x = "Share of population", y = NULL)
knitr::include_graphics(c("figs/map_share_white.png",
"figs/map_share_brown.png",
"figs/map_share_black.png",
"figs/map_share_yellow.png",
"figs/map_share_indigenous.png"))
Self-ID race shares per municipality, 2022 census.
ENEM-only commune estimates: each (muni × year × subject = “g”) cell z-scored against the national distribution, then inverse-variance weighted across years per muni. Anchor: Brazil pop-weighted mean = 84 IQ; within-race SD = 15.
knitr::include_graphics(c("figs/map_iq_municipality.png",
"figs/map_iq_state.png"))
ENEM-derived IQ, anchored to Brazil = 84 with within-race SD = 15.
S factor extracted from ENEM student questionnaire (parental education, household income, etc.), inverse-variance weighted across years per muni. Standardized to Brazil pop mean = 0, SD = 1.
knitr::include_graphics(c("figs/map_S_muni_enem.png",
"figs/map_S_state_enem.png"))
ENEM-derived S factor, standardized.
The Atlas Brasil municipal HDI for 2010 — most recent vintage available at the muni level (Atlas hasn’t released a post-2010 update yet).
knitr::include_graphics(c("figs/map_hdi_municipality.png",
"figs/map_hdi_state.png"))
HDI 2010 from UNDP/IPEA/FJP Atlas Brasil.
knitr::include_graphics("figs/scatter_S_vs_HDI.png")
ENEM-derived S vs IDHM 2010.
S and HDI correlate r ≈ 0.89 at the muni level — they’re alternative measurements of the same latent muni-developmental factor, with HDI being 15 years out-of-date.
knitr::include_graphics("figs/cor_matrix_selfid_iq_S_HDI.png")
Lower triangle: pop-weighted; upper triangle: unweighted.
The IQ / S / HDI triad correlates 0.89–0.90 — at the muni level these are essentially three measurements of one latent developmental factor.
knitr::include_graphics("figs/regression_table_iq_hdi_s.png")
OLS, race shares as proportions 0–1.
Moran’s I on OLS residuals = 0.49–0.53, p ≈ 0. Spatial-error model corrects for unobserved regional drivers.
knitr::include_graphics("figs/regression_table_iq_hdi_s_SEM.png")
Spatial error models (errorsarlm), pop-weighted.
The spatial correction:
Caveat: ecological models at muni level are brittle. The IQ / HDI redundancy means these regressions sit near a measurement-noise ceiling; “ancestry → IQ → HDI” mediation is not separable from “ancestry → general development factor” at this aggregation.
knitr::include_graphics("figs/race_distribution_by_source_table.png")
Race shares by data source vs 2022 census. Last two columns show inflation ratio for the small categories.
Both ENEM and SAEB substantially over-report Amarela vs the census (5–7×). SAEB is worse than ENEM (~10–15% of SAEB students click randomly on the race question vs ~5–10% in ENEM). The over-reporting contaminates analyses of small-category groups.
knitr::include_graphics("figs/race_subject_zscores.png")
ENEM 2017–2023 sum-correct scores standardized to national mean = 0, SD = 1 per subject.
ENEM-aggregate gradient: Branca > Amarela > Parda ≈ Preta > Indígena. Note Amarela ≈ 0 — surprisingly, given the worldwide East-Asian-vs-European pattern.
knitr::include_graphics("figs/saeb_race_subject_zscores.png")
SAEB 2019/2021/2023 by grade.
5EF (~10yo) responses are noisy due to age-related misclick contamination. 3EM (~17yo) is the cleanest — same cohort as ENEM, mandatory administration. Comparison shows ENEM and SAEB give very similar gradients; ENEM’s selection bias is mild.
If the national Amarela mean is dragged down by misclick contamination from non-Asian munis, restricting to munis with high real Amarela density should recover the East-Asian-vs-European cognitive advantage.
knitr::include_graphics("figs/amarela_vs_branca_by_density.png")
ENEM Math: Amarela vs Branca means by muni Amarela density.
The crossover at ~0.5% Amarela is the smoking gun:
knitr::include_graphics("figs/state_iq_vs_admixture.png")
Our ENEM-derived state IQ vs admixture-project CA estimates.
State-level r = 0.85 — our ENEM-derived state IQs match independent national-IQ-style estimates well.
top_south <- tribble(
~Place, ~`IQ (our scale)`, ~`HDI 2010`,
"Valinhos (SP)", 96.6, 0.819,
"Florianópolis (SC)", 93.8, 0.847,
"São Caetano do Sul (SP)", 93.0, 0.862,
"Lajeado (RS)", 93.9, 0.778,
"Curitiba (PR)", 91.6, 0.823,
"Porto Alegre (RS)", 91.1, 0.805,
"—", NA, NA,
"Italy", 97, 0.880,
"Greece", 93, 0.874,
"Spain", 96, 0.868,
"Malta", 95, 0.862,
"Cyprus", 91, 0.859,
"Portugal", 94, 0.831,
"Brazil (national)", 84, 0.722
)
top_south
Top-tier Brazilian Branca-majority munis (~93–97 IQ, HDI 0.78–0.86 in 2010) sit at IQ-matched parity with Cyprus and Portugal on HDI. Mid-tier “rich southern” munis are 0.03–0.05 below their European peers in 2010 vintage. Projecting forward (Brazil HDI grew ~+0.04 from 2010 to 2022), the highest Brazilian munis are essentially at Spain/Italy levels in current vintage.
state_color <- read_csv("data_processed/state_lapop_color.csv", show_col_types = FALSE)
cor_LAPOP_IQ <- cor(state_color$LAPOP_color, state_color$IQ_enem, use = "pairwise.complete.obs")
state_color |> select(state, LAPOP_color, IQ_enem, HDI_latest, EUR_admx, AFR_admx) |>
arrange(LAPOP_color)
State LAPOP skin-color (interviewer-rated, 0–10 palette, lighter = lower) correlates r = -0.66 with state IQ. Sample size per state is too thin for muni-level mediation; a pooled-LAPOP follow-up study would resolve this.
ENEM is the cleanest base for muni cognitive estimates — 9 years, 5,548 munis covered, internal cross-year r > 0.94. SAEB has selection-similar performance but lower per-cell precision.
At the muni level, IQ / S / HDI are all measurements of one latent factor (cross-correlations 0.89–0.90). Race-mediation analyses should be reported with this redundancy disclosed.
Spatial autocorrelation matters — λ ≈ 0.85 in error model. Without spatial correction, OLS produces spurious positive Preta/Amarela coefficients via the urban-cluster geography.
Self-ID race is contaminated by random clicking — 5–15% in SAEB, 5–10% in ENEM. Small categories (Amarela, Indígena) are inflated 5–7× over census; their group means are heavily biased by the contamination.
The Amarela “paradox” resolves with muni-density bucketing. In real Nikkei communities, Asian-Brazilians outperform Brancas by +0.59 SD on math, matching the worldwide East Asian advantage. The national-aggregate “Amarela ≤ Branca” is a Simpson’s paradox driven by misclick contamination outside Nikkei zones.
Top-tier Brazilian Branca-heavy munis are essentially at Southern European HDI levels at IQ-matched comparison.
Data prep scripts live in scripts/:
| Script | Output(s) |
|---|---|
enem.R |
per-year ENEM aggregates, panel, commune estimates |
saeb.R |
per-year SAEB aggregates |
enade.R |
per-year ENADE aggregates |
ana.R |
ANA 2014 aggregates |
encceja.R |
ENCCEJA aggregates |
prova_brasil.R |
Prova Brasil 2011 aggregates |
correlation_matrix.R |
cross-test/year correlation matrix |
idhm_validation.R |
HDI vs ENEM/SAEB validation |
race_gaps.R |
per-test race × sex × year aggregates |
score_distributions.R |
score density / NU_NOTA-bug audits |
enem_key_audit.R, enem_recover_keys.R |
answer-key recovery |
maps_iq.R |
IQ + HDI maps |
maps_selfid.R |
race share maps |
regressions_iq_hdi_s.R |
OLS + SEM regressions, saved tables |
race_distribution_by_source.R |
race-by-source distribution + click-rate |
race_subject_scores_enem.R |
ENEM race × subject sum-correct scoring |
race_subject_scores_saeb.R |
SAEB race × subject sum-correct scoring |
amarela_density_buckets.R |
Asian-vs-White by muni density |
Run all from project root:
Rscript run_all.R # orchestrates the pipeline (data → tables → figures)
The Rmd above (analysis.Rmd) renders the writeup from
the prepped outputs.