Init

library(kirkegaard)

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4          ✔ readr     2.1.5     
## ✔ forcats   1.0.1          ✔ stringr   1.6.0     
## ✔ ggplot2   4.0.1.9000     ✔ tibble    3.3.0     
## ✔ lubridate 1.9.4          ✔ tidyr     1.3.1     
## ✔ purrr     1.2.0          
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: magrittr
## 
## 
## Attaching package: 'magrittr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## Loading required package: weights
## 
## Loading required package: assertthat
## 
## 
## Attaching package: 'assertthat'
## 
## 
## The following object is masked from 'package:tibble':
## 
##     has_name
## 
## 
## Loading required package: psych
## 
## 
## Attaching package: 'psych'
## 
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## 
## 
## Loading required package: robustbase
## 
## 
## Attaching package: 'kirkegaard'
## 
## 
## The following object is masked from 'package:psych':
## 
##     rescale
## 
## 
## The following object is masked from 'package:assertthat':
## 
##     are_equal
## 
## 
## The following object is masked from 'package:purrr':
## 
##     is_logical
## 
## 
## The following object is masked from 'package:base':
## 
##     +

load_packages(
  tidyverse,
  glmnet,
  Matrix,
  pROC,
  knitr
)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-10
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

theme_set(theme_bw())

options(
    digits = 3
)

Data

Danish citizenship by naturalization requires an act of parliament listing every person granted citizenship. These laws (“Lov om indfødsrets meddelelse”) have been published 1-3 times per year since at least 1987.

We downloaded all 89 laws (1987-2025) via the retsinformation-api.dk REST API and parsed names, locations, and birth countries from the structured JSON.

d <- read_rds("data/citizenship_grants.rds")
glimpse(d)

## Rows: 141,780
## Columns: 20
## $ law_year           <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 198…
## $ law_number         <int> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 1…
## $ law_short          <chr> "LOV nr 157 af 23/03/1988", "LOV nr 157 af 23/03/19…
## $ paragraph          <chr> "§ 1.", "§ 1.", "§ 1.", "§ 1.", "§ 1.", "§ 1.", "§ …
## $ section_type       <chr> "Direct grant", "Direct grant", "Direct grant", "Di…
## $ entry_number       <chr> "100)", "101)", "102)", "103)", "104)", "105)", "10…
## $ name               <chr> "Ilhan Caglar", "Ba Tuan Cao", "Raif Cengiz", "Emil…
## $ first_name         <chr> "Ilhan", "Ba", "Raif", "Emilia", "Marta", "Fatiha",…
## $ last_name          <chr> "Caglar", "Cao", "Cengiz", "Flores", "Lopez", "Azam…
## $ location           <chr> "Næstved", "Frederikshavn", "Greve", "København", "…
## $ birth_country      <chr> "Tyrkiet", "Vietnam", "Tyrkiet", "Chile", "El Salva…
## $ country_clean      <chr> "Tyrkiet", "Vietnam", "Tyrkiet", "Chile", "El Salva…
## $ region_birth       <chr> "MENAP/Turkey", "Non-Western other", "MENAP/Turkey"…
## $ region             <chr> "MENAP/Turkey", "Non-Western other", "MENAP/Turkey"…
## $ ethnicity2         <chr> "Muslim", "Vietnamese", "Muslim", "Hispanic", "Hisp…
## $ origin_pred        <chr> "Turkish", "Vietnamese", "Turkish", "Latin American…
## $ origin_conf        <dbl> 0.725, 0.688, 0.835, 0.867, 0.725, 0.941, 0.984, 0.…
## $ pred_region        <chr> "MENAP/Turkey", "Non-Western other", "MENAP/Turkey"…
## $ origin_simple      <chr> "Turkish", "Vietnamese", "Turkish", "Romance", "Rom…
## $ origin_simple_conf <dbl> 0.780, 0.780, 0.864, 0.846, 0.843, 0.987, 0.910, 0.…

cat("Total entries:", nrow(d), "\n")

## Total entries: 141780

cat("Laws:", n_distinct(paste(d$law_year, d$law_number)), "\n")

## Laws: 85

cat("Year range:", min(d$law_year), "-", max(d$law_year), "\n")

## Year range: 1988 - 2025

Grants per year

d |>
  count(law_year, name = "n_grants") |>
  ggplot(aes(x = law_year, y = n_grants)) +
  geom_col(fill = "#377EB8") +
  scale_x_continuous(breaks = seq(1988, 2025, 2)) +
  labs(title = "Danish citizenship grants per year", x = "Year", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Name analysis

Top first names

first_names <- d |> count(first_name, sort = TRUE)
last_names <- d |> count(last_name, sort = TRUE)

first_names |>
  head(30) |>
  mutate(first_name = fct_reorder(first_name, n)) |>
  ggplot(aes(x = n, y = first_name)) +
  geom_col(fill = "#377EB8") +
  geom_text(aes(label = scales::comma(n)), hjust = -0.1, size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Top 30 first names in Danish citizenship grants (1988-2025)",
       x = "Count", y = NULL)

Merged spelling variants

muhammad_variants <- c("Mohammad", "Mohamed", "Mohammed", "Mohamad", "Mehmet",
                        "Muhammed", "Muhamed", "Mohamud", "Mohamoud", "Muhamad",
                        "Muhammet", "Muhammad", "Mohammud", "Mohamod", "Mohammod")

first_names_merged <- first_names |>
  rename(freq = n) |>
  mutate(first_name_merged = case_when(
    first_name %in% muhammad_variants ~ "Mohammad*",
    first_name %in% c("Ahmad", "Ahmed") ~ "Ahmad*",
    first_name %in% c("Fatima", "Fatma", "Fatime", "Fatimeh", "Fatimah") ~ "Fatima*",
    first_name %in% c("Hassan", "Hasan") ~ "Hassan*",
    first_name %in% c("Hussein", "Hussain", "Husein", "Husain") ~ "Hussein*",
    first_name %in% c("Mustafa", "Mustapha") ~ "Mustafa*",
    first_name %in% c("Omar", "Omer", "Ömer") ~ "Omar*",
    first_name %in% c("Sara", "Sarah") ~ "Sara*",
    first_name %in% c("Mariam", "Maryam", "Miriam", "Miryam") ~ "Mariam*",
    first_name %in% c("Ibrahim", "Ebrahim") ~ "Ibrahim*",
    first_name %in% c("Mahmoud", "Mahmud", "Mahmod") ~ "Mahmoud*",
    first_name %in% c("Khaled", "Khalid") ~ "Khaled*",
    first_name %in% c("Natalia", "Nataliya", "Natalya", "Natalie", "Nataliia", "Natalija") ~ "Natalia*",
    first_name %in% c("Elena", "Jelena", "Yelena") ~ "Elena*",
    TRUE ~ first_name
  )) |>
  summarise(count = sum(freq), .by = first_name_merged) |>
  arrange(desc(count))

first_names_merged |>
  head(30) |>
  mutate(first_name_merged = fct_reorder(first_name_merged, count)) |>
  ggplot(aes(x = count, y = first_name_merged)) +
  geom_col(fill = "#377EB8") +
  geom_text(aes(label = scales::comma(count)), hjust = -0.1, size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Top 30 first names (spelling variants merged)",
       subtitle = "* Mohammad/Mohamed/Mohammed/Mohamad/Mehmet/Muhammad etc.",
       x = "Count", y = NULL)

Mohammad* at 3971 is 2.8% of all naturalizations.

Top last names

last_names |>
  head(30) |>
  mutate(last_name = fct_reorder(last_name, n)) |>
  ggplot(aes(x = n, y = last_name)) +
  geom_col(fill = "#E41A1C") +
  geom_text(aes(label = scales::comma(n)), hjust = -0.1, size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Top 30 last names in Danish citizenship grants (1988-2025)",
       x = "Count", y = NULL)

Region analysis

Official region data

The laws provide two sources of origin information:

1988-1998: Birth country listed per person
2021-2025: Paragraphs split by origin region (Nordic, Western, MENAP/stateless, Non-Western other)
2000-2020: No origin information in the law text

region_by_year <- d |>
  filter(!is.na(region)) |>
  count(law_year, region) |>
  group_by(law_year) |>
  mutate(total = sum(n), pct = n / total * 100) |>
  ungroup() |>
  mutate(region = factor(region, levels = c("Nordic", "Western", "MENAP/Turkey", "Non-Western other")))

cols_region <- c("Nordic" = "#4DAF4A", "Western" = "#377EB8",
                  "MENAP/Turkey" = "#E41A1C", "Non-Western other" = "#FF7F00")

ggplot(region_by_year, aes(x = law_year, y = pct, fill = region)) +
  geom_col() +
  annotate("rect", xmin = 1999.5, xmax = 2020.5, ymin = 0, ymax = Inf,
           alpha = 0.15, fill = "grey50") +
  annotate("text", x = 2010, y = 50, label = "No region data\n(2000-2020)",
           size = 3.5, color = "grey30") +
  scale_fill_manual(values = cols_region) +
  scale_x_continuous(breaks = seq(1988, 2025, 2)) +
  scale_y_continuous(labels = \(x) paste0(x, "%")) +
  labs(title = "Danish citizenship grants by origin region (official data)",
       x = "Year", y = "Percentage", fill = "Region") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note: the 2021-2025 region labels reflect current passport held, not ethnic origin. Many “Nordic” passport holders have non-Nordic names (immigrants who naturalized in Sweden/Norway first).

Ethnicity classification

Model

We trained a character n-gram ridge regression classifier to predict ethnic origin from names:

Features: Character n-grams of length 1-5, minimum frequency 50 (4,885 features)
Training data: 29,353 names with known birth country (1988-1998 laws)
Method: Ridge regression (alpha=0), 10-fold CV for lambda
Evaluation: Nested 10x10 CV (unbiased)

21-group model

m21 <- tribble(
  ~Origin, ~N, ~Recall, ~Precision, ~F1, ~AUC,
  "Arab", 4808, 87.3, 81.2, 84.1, 0.978,
  "Iranian", 4652, 90.7, 87.2, 88.9, 0.990,
  "East European", 3502, 86.3, 84.9, 85.6, 0.982,
  "Turkish", 3337, 96.7, 91.6, 94.1, 0.997,
  "Vietnamese", 2235, 97.9, 95.5, 96.7, 0.998,
  "Sri Lankan", 1865, 94.2, 97.1, 95.6, 0.998,
  "North African", 1369, 68.8, 72.3, 70.5, 0.959,
  "Southeast Asian", 1091, 61.8, 62.6, 62.2, 0.957,
  "Pakistani", 1313, 78.6, 74.8, 76.7, 0.986,
  "Germanic", 1095, 71.9, 58.6, 64.6, 0.973,
  "Latin American", 771, 52.1, 61.5, 56.4, 0.967,
  "Anglo", 656, 51.2, 51.6, 51.4, 0.958,
  "East African", 459, 36.2, 64.2, 46.3, 0.935,
  "Nordic", 391, 40.0, 55.2, 46.4, 0.960,
  "Chinese", 387, 71.4, 75.0, 73.2, 0.988,
  "West African", 362, 25.0, 51.4, 33.6, 0.933,
  "South Asian", 332, 35.3, 57.1, 43.6, 0.888,
  "French/South European", 364, 27.7, 34.6, 30.8, 0.923,
  "Afghan", 146, 15.4, 57.1, 24.3, 0.950,
  "Somali/Arab", 122, 50.0, 61.9, 55.3, 0.943,
  "East Asian other", 96, 11.8, 33.3, 17.4, 0.964
)

kable(m21, caption = "21-group elastic net model (train/test split)")

21-group elastic net model (train/test split)
Origin	N	Recall	Precision	F1	AUC
Arab	4808	87.3	81.2	84.1	0.978
Iranian	4652	90.7	87.2	88.9	0.990
East European	3502	86.3	84.9	85.6	0.982
Turkish	3337	96.7	91.6	94.1	0.997
Vietnamese	2235	97.9	95.5	96.7	0.998
Sri Lankan	1865	94.2	97.1	95.6	0.998
North African	1369	68.8	72.3	70.5	0.959
Southeast Asian	1091	61.8	62.6	62.2	0.957
Pakistani	1313	78.6	74.8	76.7	0.986
Germanic	1095	71.9	58.6	64.6	0.973
Latin American	771	52.1	61.5	56.4	0.967
Anglo	656	51.2	51.6	51.4	0.958
East African	459	36.2	64.2	46.3	0.935
Nordic	391	40.0	55.2	46.4	0.960
Chinese	387	71.4	75.0	73.2	0.988
West African	362	25.0	51.4	33.6	0.933
South Asian	332	35.3	57.1	43.6	0.888
French/South European	364	27.7	34.6	30.8	0.923
Afghan	146	15.4	57.1	24.3	0.950
Somali/Arab	122	50.0	61.9	55.3	0.943
East Asian other	96	11.8	33.3	17.4	0.964

Overall: Accuracy 81.1%, Macro F1 61.8, Weighted F1 80.4, Macro AUC 0.963, Weighted AUC 0.980

10-group model (merged)

Groups merged for cleaner classification:

MENAP: Arab + Iranian + North African + Pakistani + Afghan + Somali
Germanic: Nordic + Germanic + Anglo
Romance: French/South European + Latin American
South Asian: Sri Lankan + South Asian
East Asian: Chinese + East Asian other
Sub-Saharan African: East African + West African

m10 <- tribble(
  ~Origin, ~N, ~Recall, ~Precision, ~F1, ~AUC,
  "MENAP", 12410, 97.5, 92.7, 95.0, 0.991,
  "East European", 3502, 86.8, 85.8, 86.3, 0.986,
  "Turkish", 3337, 96.3, 93.2, 94.7, 0.997,
  "Vietnamese", 2235, 97.7, 96.8, 97.2, 0.998,
  "South Asian", 2197, 87.5, 95.8, 91.5, 0.985,
  "Germanic", 2142, 76.7, 70.4, 73.4, 0.973,
  "Romance", 1135, 57.7, 65.5, 61.4, 0.963,
  "Southeast Asian", 1091, 51.9, 61.8, 56.4, 0.958,
  "Sub-Saharan African", 821, 34.7, 67.1, 45.7, 0.920,
  "East Asian", 483, 65.6, 85.9, 74.4, 0.975
)

kable(m10, caption = "10-group ridge model (nested 10x10 CV, unbiased)")

10-group ridge model (nested 10x10 CV, unbiased)
Origin	N	Recall	Precision	F1	AUC
MENAP	12410	97.5	92.7	95.0	0.991
East European	3502	86.8	85.8	86.3	0.986
Turkish	3337	96.3	93.2	94.7	0.997
Vietnamese	2235	97.7	96.8	97.2	0.998
South Asian	2197	87.5	95.8	91.5	0.985
Germanic	2142	76.7	70.4	73.4	0.973
Romance	1135	57.7	65.5	61.4	0.963
Southeast Asian	1091	51.9	61.8	56.4	0.958
Sub-Saharan African	821	34.7	67.1	45.7	0.920
East Asian	483	65.6	85.9	74.4	0.975

Overall: Accuracy 88.3%, Macro F1 77.6, Weighted F1 87.8, Macro AUC 0.975, Weighted AUC 0.985

Predicted origin over time (10 groups)

origin_simple_year <- d |>
  count(law_year, origin_simple) |>
  group_by(law_year) |>
  mutate(total = sum(n), pct = n / total * 100) |>
  ungroup() |>
  mutate(origin_simple = factor(origin_simple, levels = c(
    "Germanic", "Romance", "East European",
    "Turkish", "MENAP",
    "South Asian", "Vietnamese", "East Asian", "Southeast Asian",
    "Sub-Saharan African"
  )))

simple_cols <- c(
  "Germanic" = "#1b9e77", "Romance" = "#a6cee3", "East European" = "#e6ab02",
  "Turkish" = "#d95f02", "MENAP" = "#e41a1c",
  "South Asian" = "#984ea3", "Vietnamese" = "#00bfc4",
  "East Asian" = "#377eb8", "Southeast Asian" = "#80b1d3",
  "Sub-Saharan African" = "#ff7f00"
)

ggplot(origin_simple_year, aes(x = law_year, y = n, fill = origin_simple)) +
  geom_col() +
  scale_fill_manual(values = simple_cols) +
  scale_x_continuous(breaks = seq(1988, 2025, 2)) +
  labs(title = "Danish citizenship grants by predicted ethnic origin (1988-2025)",
       subtitle = "10 groups, ridge classifier on character n-grams (88% accuracy)",
       x = "Year", y = "Number of grants", fill = "Origin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(origin_simple_year, aes(x = law_year, y = pct, fill = origin_simple)) +
  geom_col() +
  scale_fill_manual(values = simple_cols) +
  scale_x_continuous(breaks = seq(1988, 2025, 2)) +
  scale_y_continuous(labels = \(x) paste0(x, "%")) +
  labs(title = "Danish citizenship grants by predicted ethnic origin (1988-2025)",
       subtitle = "10 groups, ridge classifier on character n-grams (88% accuracy)",
       x = "Year", y = "Percentage", fill = "Origin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Predicted origin over time (21 groups)

origin_by_year <- d |>
  count(law_year, origin_pred) |>
  group_by(law_year) |>
  mutate(total = sum(n), pct = n / total * 100) |>
  ungroup() |>
  mutate(origin_pred = factor(origin_pred, levels = c(
    "Nordic", "Germanic", "Anglo", "French/South European",
    "East European", "Latin American",
    "Turkish", "Arab", "Iranian", "North African", "Somali/Arab", "Afghan",
    "Pakistani", "Sri Lankan", "South Asian",
    "Vietnamese", "Chinese", "Southeast Asian", "East Asian other",
    "East African", "West African"
  )))

origin_cols <- c(
  "Nordic" = "#1b9e77", "Germanic" = "#66a61e", "Anglo" = "#7570b3",
  "French/South European" = "#a6cee3", "East European" = "#e6ab02",
  "Latin American" = "#e7298a",
  "Turkish" = "#d95f02", "Arab" = "#e41a1c", "Iranian" = "#fb8072",
  "North African" = "#fdb462", "Somali/Arab" = "#bc80bd", "Afghan" = "#ccebc5",
  "Pakistani" = "#984ea3", "Sri Lankan" = "#8dd3c7", "South Asian" = "#bebada",
  "Vietnamese" = "#00bfc4", "Chinese" = "#377eb8", "Southeast Asian" = "#80b1d3",
  "East Asian other" = "#f781bf",
  "East African" = "#ff7f00", "West African" = "#b15928"
)

ggplot(origin_by_year, aes(x = law_year, y = pct, fill = origin_pred)) +
  geom_col() +
  scale_fill_manual(values = origin_cols) +
  scale_x_continuous(breaks = seq(1988, 2025, 2)) +
  scale_y_continuous(labels = \(x) paste0(x, "%")) +
  labs(title = "Danish citizenship grants by predicted name origin (21 groups)",
       x = "Year", y = "Percentage", fill = "Origin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Validation against DST

We validated predictions against Statistics Denmark (DST) table DKSTAT which records citizenships granted by previous nationality.

knitr::include_graphics("figs/dst_validation.png")

Overall correlation of percentage shares: r = 0.82. Best validated groups: Vietnamese (r=0.98), Sri Lankan (r=0.97), Germanic (r=0.96), Anglo (r=0.96).

Model summary

knitr::include_graphics("figs/model_summary.png")

Confusion matrix (10 groups)

knitr::include_graphics("figs/confusion_matrix_10group.png")

MENAP names are classified with 98% recall. Sub-Saharan African names leak 34% to MENAP due to Muslim naming overlap. The European groups (Germanic, Romance, East European) show moderate cross-confusion due to shared cultural origins.

Classifier confidence

The AUC measures how well the model separates true cases from non-cases. Blue = true members of the group, red = everyone else. Good separation = high AUC.

10-group model

knitr::include_graphics("figs/confidence_auc_10group.png")

21-group model

knitr::include_graphics("figs/confidence_auc_21group.png")

Confidence distribution (10-group model)

knitr::include_graphics("figs/confidence_ridge.png")

Pooled distribution

knitr::include_graphics("figs/origin_simple_pooled.png")

About half of all naturalizations (MENAP 43% + Turkish 10% = 53%) involve people with Muslim-origin names.

France comparison

France also publishes naturalization decrees in the Journal Officiel, listing every person with their name, birth date, and birth country. We extracted one decree (June 2025, n=631) and applied the same classifier.

knitr::include_graphics("figs/france_vs_denmark_2025.png")

MENAP names dominate France even more (65% vs 43%), reflecting the Maghreb colonial connection. Denmark has more Turkish (guest worker legacy) and East European (EU expansion) naturalizations. France has more Sub-Saharan African (francophone Africa). Note that birth country can be misleading for France-born children of immigrants – the name-based prediction captures ethnic origin regardless of birthplace.

Who gets Danish citizenship?

06 April, 2026

Init

Data

Grants per year

Name analysis

Top first names

Merged spelling variants

Top last names

Region analysis

Official region data

Ethnicity classification

Model

21-group model

10-group model (merged)

Predicted origin over time (10 groups)

Predicted origin over time (21 groups)

Validation against DST

Model summary

Confusion matrix (10 groups)

Classifier confidence

10-group model

21-group model

Confidence distribution (10-group model)

Pooled distribution

France comparison

Meta