ALC word embeddings

calculate embeddings for UNSC speeches on “human rights”

Here, I begin by identifying all contexts in the UNSC speeches that mention “human rights” (or variants thereof). I then compute à la carte (ALC) embeddings for these contexts, and average them to the country level. Finally, I inspect the nearest neighbors of each country’s “human rights” embedding to see how different countries discuss human rights in the UNSC.

setwd("~/Desktop/Aarhus_Y1/16_UnText/1_UN_data/security_1992:2023")
merged_df <- read.csv("merged_UNSC_meta.csv")
pre_trained   <- readRDS("glove.rds")        
transform_mat <- readRDS("khodakA.rds")
# --- 1) Build corpus & tokens ---
corp <- corpus(merged_df, text_field = "content")
docvars(corp, "country_org") <- merged_df$country_org
docvars(corp, "year")        <- merged_df$year

toks <- tokens(
  corp,
  remove_punct = TRUE, remove_symbols = TRUE,
  remove_numbers = TRUE, remove_separators = TRUE
) |>
  tokens_tolower()

# --- 2) Extract contexts around "human rights" ---
# The hyphen will have been removed by tokenization, so "human rights" still matches those cases.
toks_hr <- tokens_context(
  x = toks,
  pattern = c("human* right*"),   # catches 'human rights', 'human-rights', etc.
  window = 6                      # ±6-word window (tune as needed)
)

## 18 instances of "human right" found.
## 1 instances of "human righta" found.
## 7201 instances of "human rights" found.
## 3 instances of "human rights-based" found.
## 1 instances of "human rights-centred" found.
## 1 instances of "human rights-compliant" found.
## 1 instances of "human rights-focused" found.
## 3 instances of "human rights-respecting" found.
## 7 instances of "humanitarian rights" found.
## 1 instances of "humanity right" found.

Given a tokenized corpus of contexts, we first construct its corresponding document-feature matrix (DFM), align it with the pretrained embeddings, and compute ALC embeddings for each context. As a result, we obtain an ALC embedding for every instance of human* right* in the sample corpus. To derive a single, corpus-wide ALC embedding for “human* right*” we simply take the column-wise average of the individual instance embeddings.

Of course we are more interested in country-level semantic differences across groups. Thus, we further average the context-level ALC embeddings to the country level, yielding one vector per country that describes its human-rights semantics in the UNSC.

# --- 3) DFM of HR contexts; align with embedding vocab ---
dfm_hr <- dfm(toks_hr)
dfm_hr <- dfm_tolower(dfm_hr)
dfm_hr <- dfm_remove(dfm_hr, stopwords("en"))
dfm_hr <- dfm_trim(dfm_hr, min_termfreq = 5)  # drop very rare terms (tune)

# Keep only features that exist in the pretrained matrix
dfm_hr <- dfm_match(dfm_hr, features = rownames(pre_trained))

# Drop empty context-docs
dfm_hr <- dfm_subset(dfm_hr, ntoken(dfm_hr) > 0)
stopifnot(ndoc(dfm_hr) > 0)

# --- 4) ALC (à la carte) embeddings for each context ---
dem_hr <- dem(
  x                = dfm_hr,
  pre_trained      = pre_trained,
  transform        = TRUE,
  transform_matrix = transform_mat,
  verbose          = TRUE
)

# --- 5) Average to country-level HR vectors ---
# one vector per country describing its *human-rights* semantics
wv_country <- dem_group(dem_hr, groups = dem_hr@docvars$country_org)  # rows = countries

Now we have country-level ALC embeddings for “human rights”. Next, we can inspect the nearest neighbors of each country’s HR vector in the original embedding space to see how different countries discuss human rights in the UNSC.

We could do this because we have a pretrained word embedding matrix (GloVe) that covers a large vocabulary. By computing cosine similarity between each country’s HR vector and all words in the pretrained space, we can identify the top-N nearest neighbors for each country. This gives us insight into the specific terms and concepts that are semantically associated with “human rights” for each country in the UNSC context.

# --- 6) Nearest neighbors per country (top 10) ---
# We'll compute cosine similarity between each country HR vector and ALL vocabulary words.

# normalize pretrained rows once for cosine
row_norms <- sqrt(rowSums(pre_trained^2))
W_unit <- pre_trained / row_norms
W_unit[!is.finite(W_unit)] <- 0

# helper to get top-N neighbors for one vector
top_nns <- function(vec, k = 10, exclude = c("human","right","rights")){
  v <- as.numeric(vec)
  v <- v / sqrt(sum(v * v))
  sims <- as.numeric(W_unit %*% v)                 # cosine
  names(sims) <- rownames(W_unit)
  sims <- sort(sims, decreasing = TRUE)

  # drop targets/stopwords/short tokens
  bad <- unique(c(exclude, stopwords("en")))
  keep <- setdiff(names(sims), bad)
  sims[keep][seq_len(min(k, length(keep)))]
}

nn_tbl <- map_dfr(rownames(wv_country), function(cty) {
  sims <- top_nns(wv_country[cty, , drop = FALSE], k = 10)
  tibble(
    country_org = cty,
    rank = seq_along(sims),
    neighbor = names(sims),
    cosine = as.numeric(sims)
  )
})

# --- 7) Inspect / save ---
nn_tbl %>%
  arrange(country_org, rank)

## # A tibble: 50 × 4
##    country_org  rank neighbor      cosine
##    <chr>       <int> <chr>          <dbl>
##  1 China           1 safeguarding   0.502
##  2 China           2 humanitarian   0.478
##  3 China           3 peacebuilding  0.448
##  4 China           4 unsc           0.437
##  5 China           5 1244           0.431
##  6 China           6 post-conflict  0.428
##  7 China           7 resolutions    0.427
##  8 China           8 safeguard      0.427
##  9 China           9 ensuring       0.416
## 10 China          10 norms          0.415
## # ℹ 40 more rows

library(knitr)
library(kableExtra)

# assuming nn_tbl has: country_org, rank, neighbor, cosine

nn_summary <- nn_tbl %>%
  arrange(country_org, rank) %>%
  group_by(country_org) %>%
  summarise(
    top_words = paste0(
      sprintf("%d. %s", rank, neighbor),
      collapse = ", "
    ),
    .groups = "drop"
  )

# print nicely
kable(nn_summary, align = c("l", "l"), caption = "Top 10 nearest neighbors for 'human rights' by country") %>%
  kable_styling(full_width = FALSE, position = "center", font_size = 12)

Top 10 nearest neighbors for ‘human rights’ by country
country_org	top_words
China	safeguarding, 2. humanitarian, 3. peacebuilding, 4. unsc, 5. 1244, 6. post-conflict, 7. resolutions, 8. safeguard, 9. ensuring, 10. norms
France	violations, 2. norms, 3. respecting, 4. breaches, 5. ensuring, 6. violation, 7. safeguarding, 8. humanitarian, 9. flagrant, 10. abuses
Russian Federation	violations, 2. norms, 3. safeguarding, 4. humanitarian, 5. breaches, 6. respecting, 7. resolutions, 8. ensuring, 9. flagrant, 10. human-rights
United Kingdom Of Great Britain And Northern Ireland	violations, 2. abuses, 3. ensuring, 4. safeguarding, 5. norms, 6. respecting, 7. breaches, 8. flagrant, 9. impunity, 10. violation
United States Of America	violations, 2. abuses, 3. breaches, 4. respecting, 5. flagrant, 6. norms, 7. impunity, 8. safeguarding, 9. egregious, 10. ensuring

Temporal variation

To check the temporal variation in human-rights semantics, we can compute country-year level ALC embeddings for “human rights” and then analyze how these embeddings change over time. One way to visualize this is by calculating the cosine similarity of each country-year embedding to the “human rights” anchor vector, which has been derived from the average embedding of all contexts mentioning “human rights” across the entire corpus.

The cosine similarity in the plot measures how closely a specific country-year’s meaning aligns with the global anchor. A higher similarity indicates that the country’s discourse on human rights in that year is more consistent with the overall global discourse on human rights.

# --- 1) Average human-rights ALC embeddings per country-year ---
wv_cy <- dem_group(
  dem_hr,
  groups = paste(dem_hr@docvars$country_org, dem_hr@docvars$year, sep = "_")
)

cy_df <- tibble(
  cy  = rownames(wv_cy),
  U   = I(asplit(wv_cy, 1))
) |>
  tidyr::separate(cy, into = c("country_org", "year"), sep = "_", remove = FALSE) |>
  mutate(year = as.integer(year))

# helper functions
l2  <- function(v) sqrt(sum(v*v))
unit_vec <- function(v) v / l2(v)
cos_sim  <- function(a,b) sum(a*b) / (l2(a)*l2(b))

# average all context embeddings
hr_anchor <- unit_vec(colMeans(as.matrix(dem_hr)))

cy_df <- cy_df %>%
  mutate(
    hr_sim = map_dbl(U, ~ cos_sim(.x, hr_anchor))
  )

And we could see that China and Russia are deviating from the global human-rights discourse more than other countries, especially during certain conflict events.

conflict_events <- tibble::tribble(
  ~year, ~label,
  1994, "Rwanda",
  1999, "Kosovo",
  2003, "Iraq",
  2022, "Russia–Ukraine"
)


ggplot(cy_df, aes(x = year, y = hr_sim, color = country_org)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.6) +
  labs(
    title = "Semantic alignment with the global human-rights anchor",
    subtitle = "Cosine similarity of country–year ALC embeddings to the corpus-wide 'human rights' centroid",
    x = "Year", y = "Cosine similarity"
  ) +
  theme_minimal() +
  theme(legend.title = element_blank()) +
    geom_point(size = 1.6) +
  geom_vline(data = conflict_events, aes(xintercept = year),
             linetype = "dashed", alpha = 0.6)

UMAP

Now, this plot shows a UMAP projection of country–year embeddings for “human rights” discourse. Each point represents the ALC embedding of a given country and year, with color indicating the country and labels showing the year. The UMAP reduces high-dimensional semantic information to two dimensions based on cosine similarity, so points that appear closer together reflect more similar meanings of “human rights” across countries or years. Clusters suggest shared or converging discourses, while more distant points indicate divergent national interpretations over time

library(uwot)
emb_mat <- do.call(rbind, cy_df$U)
rownames(emb_mat) <- paste(cy_df$country_org, cy_df$year, sep = "_")
set.seed(42)
um <- umap(
  emb_mat,
  n_neighbors = 10,
  min_dist    = 0.2,
  metric      = "cosine"
)

cy_df$UMAP1 <- um[,1]
cy_df$UMAP2 <- um[,2]


ggplot(cy_df, aes(x = UMAP1, y = UMAP2, color = country_org)) +
  #geom_path(aes(group = country_org), linewidth = 1, alpha = 0.7) +
  geom_point(size = 2, alpha = 0.9) +
  ggrepel::geom_text_repel(
    aes(label = year),
    size = 3,
    max.overlaps = 40,
    show.legend = FALSE
  ) +
  #facet_wrap(~ country_org, scales = "free") +
  labs(
    title = "UMAP projection of country–year human-rights embeddings",
    subtitle = "Each point = ALC embedding of 'human rights' discourse in that country–year",
    x = "UMAP1",
    y = "UMAP2",
    color = "Country"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.title = element_blank(),
    panel.grid.minor = element_blank()
  )

Sentence Embedding

Here I use a pretrained sentence embedding to compute sentence-level embeddings for all sentences mentioning “human rights” in the UNSC speeches. I then average these sentence embeddings to the country-year level to obtain a representation of how each country discusses human rights over time. Finally, I analyze the semantic alignment of these country-year embeddings with a curated set of human-rights seed phrases, as well as with the overall global discourse on human rights.

library(reticulate)

setup_embed_env <- function(
  envname = "hr_embed_env",
  pkgs = c("sentence-transformers","numpy","pandas","tqdm"),
  model_name = "sentence-transformers/all-MiniLM-L6-v2"
) {
  if (!envname %in% virtualenv_list()) {
    virtualenv_create(envname)
  }
  py_install(packages = pkgs, envname = envname, pip = TRUE)
  use_virtualenv(envname, required = TRUE)

  st <- import("sentence_transformers", convert = TRUE)
  model <- st$SentenceTransformer(model_name)

  list(
    model = model,
    embed_dim = as.integer(model$get_sentence_embedding_dimension())
  )
}

env <- setup_embed_env()

## Using virtual environment 'hr_embed_env' ...

model <- env$model
embed_dim <- env$embed_dim

The goal of this section is to transform raw sentences mentioning “human rights” into numerical vector representations (embeddings) using a pretrained sentence-transformer model (like all-MiniLM-L6-v2).

These embeddings capture semantic meaning — so that sentences with similar ideas end up close together in high-dimensional space.

encode_texts <- function(texts,
                         batch_size = 32,
                         normalize = TRUE) {
  if (length(texts) == 0) {
    return(matrix(numeric(0), nrow = 0, ncol = embed_dim))
  }
  embs <- model$encode(
    texts,
    batch_size = as.integer(batch_size),
    show_progress_bar = FALSE,
    normalize_embeddings = normalize
  )
  embs_mat <- if (is.null(dim(embs))) {
    matrix(as.numeric(embs), nrow = 1)
  } else {
    as.matrix(embs)
  }
  embs_mat
}

l2       <- function(v) sqrt(sum(v * v))
unit_vec <- function(v) if (all(v == 0)) v else v / l2(v)
cos_sim  <- function(a, b) {
  da <- l2(a); db <- l2(b)
  if (da == 0 || db == 0) return(NA_real_)
  sum(a * b) / (da * db)
}

hr_docs <- merged_df %>%
  filter(str_detect(content, regex("human right", ignore_case = TRUE)))

hr_sentences <- hr_docs %>%
  mutate(sentences = map(content, ~ tokenize_sentences(.x)[[1]])) %>%
  select(country_org, year, sentences) %>%
  unnest(sentences) %>%
  rename(sentence = sentences) %>%
  filter(str_detect(sentence, regex("human right", ignore_case = TRUE))) %>%
  mutate(sentence = str_squish(sentence))

# if this is empty for some country/years, they'll just drop out downstream

# vector of all sentences
all_sents <- hr_sentences$sentence

all_embs_mat <- encode_texts(all_sents, batch_size = 64, normalize = TRUE)

# attach embeddings back
hr_sentences$embedding <- split(
  all_embs_mat,
  rep(seq_len(nrow(all_embs_mat)), each = 1)
)

Here I take all the “human rights” sentences for a given country and year, and averaging their embeddings. This yields one mean vector per country–year — the semantic centroid of how that country framed human rights in that year. It smooths out individual-sentence noise, letting you track broad rhetorical or ideological trends over time.

Then I create create two types of reference embeddings to serve as comparison points.The first one is a Seed anchor (hr_anchor_seed): Computed from a small, curated set of canonical human-rights phrases (like “freedom from torture”, “rule of law”). This represents a normative, idealized concept of human rights, grounded in international legal and rights-based language

The second one is similar to my previous logic. I computed as the average embedding across all “human rights” sentences in your corpus. It represents the empirical center of how “human rights” is discussed globally — the data-driven “mainstream” meaning.

country_year_hr <- hr_sentences %>%
  group_by(country_org, year) %>%
  summarise(
    hr_embedding = list(colMeans(do.call(rbind, embedding))),
    .groups = "drop"
  )

hr_seed_phrases <- c(
  "human rights",
  "international human rights law",
  "civil and political rights",
  "economic, social, and cultural rights",
  "protection of human rights defenders",
  "freedom from torture",
  "freedom of expression",
  "rule of law and accountability for violations",
  "non-discrimination and equality before the law"
)

hr_seed_embs <- encode_texts(hr_seed_phrases, batch_size = 16, normalize = TRUE)
hr_anchor_seed <- unit_vec(colMeans(hr_seed_embs))
hr_anchor_global <- unit_vec(colMeans(all_embs_mat))

country_year_hr <- country_year_hr %>%
  mutate(
    hr_sim_seed   = sapply(hr_embedding, function(u) cos_sim(u, hr_anchor_seed)),
    hr_sim_global = sapply(hr_embedding, function(u) cos_sim(u, hr_anchor_global)))

library(ggplot2)

ggplot(country_year_hr, aes(x = year, y = hr_sim_seed, color = country_org)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.8) +
  labs(
    title = "Semantic alignment with the human-rights seed anchor",
    subtitle = "Cosine similarity of country–year HR embeddings to curated HR anchor",
    x = "Year",
    y = "Cosine similarity",
    color = "Country"
  ) +
  theme_minimal() +
  theme(legend.title = element_blank())

ggplot(country_year_hr, aes(year, hr_sim_global, color = country_org)) +
  geom_line() + geom_point() +
  labs(
    title = "Alignment with global HR discourse (alternative anchor)",
    x = "Year", y = "Cosine similarity"
  ) +
  theme_minimal() +
  theme(legend.title = element_blank())

library(uwot)

emb_mat_cy <- do.call(rbind, country_year_hr$hr_embedding)
rownames(emb_mat_cy) <- paste(country_year_hr$country_org, country_year_hr$year, sep = "_")

set.seed(123)
um <- umap(
  emb_mat_cy,
  n_neighbors = 10,
  min_dist = 0.2,
  metric = "cosine"
)

country_year_hr$UMAP1 <- um[,1]
country_year_hr$UMAP2 <- um[,2]

ggplot(country_year_hr, aes(UMAP1, UMAP2, color = country_org)) +
  #geom_path(aes(group = country_org), linewidth = 1, alpha = 0.7) +
  geom_point(size = 2, alpha = 0.9) +
  ggrepel::geom_text_repel(aes(label = year), size = 3, max.overlaps = 40) +
  theme_minimal() +
  labs(
    title = "UMAP projection of country–year 'human rights' embeddings",
    subtitle = "Sentence-transformer embeddings; trajectories show semantic drift",
    x = "UMAP1", y = "UMAP2",
    color = "Country"
  )

To characterize the semantic field of “human rights” for each country, we computed the ten nearest lexical neighbors to each country’s mean human-rights embedding. These terms represent the vocabulary most semantically aligned with each country’s human-rights discourse, allowing qualitative interpretation of differences in framing across countries.

library(quanteda)

tokens_hr <- tokens(
  tolower(hr_sentences$sentence),
  remove_punct = TRUE,
  remove_numbers = TRUE
) %>%
  tokens_select(pattern = stopwords("en"), selection = "remove")

vocab <- unique(unlist(tokens_hr))
vocab <- vocab[nchar(vocab) > 3]

word_embs <- encode_texts(vocab, batch_size = 64, normalize = TRUE)
rownames(word_embs) <- vocab

get_top_words <- function(vec, k = 10) {
  v <- unit_vec(vec)
  sims <- as.numeric(word_embs %*% v)
  names(sims) <- rownames(word_embs)
  drop_terms <- c("human","right","rights","humanrights","human-rights")
  sims <- sims[!names(sims) %in% drop_terms]
  sims <- sort(sims, decreasing = TRUE)
  head(sims, k)
}

nn_tbl <- country_year_hr %>%
  group_by(country_org) %>%
  summarise(
    top_words = list(names(get_top_words(hr_embedding[[1]], k = 10))),
    .groups = "drop"
  )

library(kableExtra)

nn_tbl %>%
  mutate(top_words = sapply(top_words, function(w) paste(w, collapse = ", "))) %>%
  kable(
    caption = "Top 10 nearest terms to each country's human-rights embedding (seed-anchor space)"
  ) %>%
  kable_styling(full_width = FALSE, position = "center", font_size = 11)

Top 10 nearest terms to each country’s human-rights embedding (seed-anchor space)
country_org	top_words
China	rights-compliant, rights-based, rights-respecting, rights-centred, non-peacekeeping, rights-focused, protection-of-civilians, peacekeeping, sovereignty, nations-mandated
France	peacekeeping, non-peacekeeping, protection-of-civilians, humanitarian, detainees, humanitarians, humanitarianism, rights-compliant, anti-genocide, civilians
Russian Federation	protection-of-civilians, peacekeeping, rights-respecting, non-peacekeeping, protection-of-civilian, inhumanity, rights-compliant, detainees, rights-centred, humanitarian
United Kingdom Of Great Britain And Northern Ireland	peacekeeping, protection-of-civilians, non-peacekeeping, rights-respecting, rights-compliant, peace-keeping, protection-of-civilian, detainees, rights-centred, freedoms
United States Of America	genocide, humanitarianism, anti-genocide, atrocities, detainees, humanitarian, peacekeeping, dictatorship, non-peacekeeping, humanitarians

Embedding procedures

We employed two complementary embedding approaches to analyze the semantic framing of “human rights” across United Nations Security Council speeches.

First, we applied à la carte (ALC) word embeddings following Rodríguez et al. (2023), which combine pretrained static embeddings (e.g., GloVe) with corpus-specific transformation matrices to derive context-adjusted word vectors. ALC embeddings capture micro-level shifts in the meaning of specific terms—in this case, how the concept “human rights” changes across countries and over time.

Second, we used sentence embeddings generated with the pretrained transformer model all-MiniLM-L6-v2 (via the sentence-transformers library). Sentence embeddings capture the contextual meaning of full sentences, allowing us to summarize each country–year’s discourse on human rights as a high-dimensional vector and to measure its alignment with a global human-rights anchor using cosine similarity. This method provides a macro-level view of cross-national and temporal variation in human-rights discourse.

UNSC-sematic check

Winnie

2025-11-07