ALC word embeddings

calculate embeddings for UNSC speeches on “human rights”

Here, I begin by identifying all contexts in the UNSC speeches that mention “human rights” (or variants thereof). I then compute à la carte (ALC) embeddings for these contexts, and average them to the country level. Finally, I inspect the nearest neighbors of each country’s “human rights” embedding to see how different countries discuss human rights in the UNSC.

setwd("~/Desktop/Aarhus_Y1/16_UnText/1_UN_data/security_1992:2023")

# Data and resources
merged_df    <- read.csv("merged_UNSC_meta.csv")
pre_trained  <- readRDS("glove.rds")
transform_mat <- readRDS("khodakA.rds")

# Tunable parameters
context_window <- 6   # ± window around "human rights"
min_termfreq   <- 5   # minimum term frequency in HR contexts


## ----------------------------------------------------------
## 1. Build corpus with document variables
## ----------------------------------------------------------

corp <- corpus(merged_df, text_field = "content")

docvars(corp, "country_org") <- merged_df$country_org
docvars(corp, "year")        <- merged_df$year


## ----------------------------------------------------------
## 2. Tokenize text (standard preprocessing)
## ----------------------------------------------------------

toks <- tokens(
  corp,
  remove_punct      = TRUE,
  remove_symbols    = TRUE,
  remove_numbers    = TRUE,
  remove_separators = TRUE
) |>
  tokens_tolower()

Given a tokenized corpus of contexts, we first construct its corresponding document-feature matrix (DFM), align it with the pretrained embeddings, and compute ALC embeddings for each context. As a result, we obtain an ALC embedding for every instance of human* right* in the sample corpus. To derive a single, corpus-wide ALC embedding for “human* right*” we simply take the column-wise average of the individual instance embeddings.

Of course we are more interested in country-level semantic differences across groups. Thus, we further average the context-level ALC embeddings to the country level, yielding one vector per country that describes its human-rights semantics in the UNSC.

## ----------------------------------------------------------
## 3. Extract contexts around “human rights”
##    (ALC on local contexts only)
## ----------------------------------------------------------

# NOTE:
# - The hyphen in “human-rights” has been removed by tokenization.
# - The wildcard pattern "human* right*" is used to match variants.
dfm_hr <- toks |>
  tokens_context(
    pattern = c("human* right*"),
    window  = context_window
  ) |>
  dfm() |>
  dfm_tolower() |>
  dfm_remove(stopwords("en")) |>
  dfm_trim(min_termfreq = min_termfreq)

## 18 instances of "human right" found.
## 1 instances of "human righta" found.
## 7201 instances of "human rights" found.
## 3 instances of "human rights-based" found.
## 1 instances of "human rights-centred" found.
## 1 instances of "human rights-compliant" found.
## 1 instances of "human rights-focused" found.
## 3 instances of "human rights-respecting" found.
## 7 instances of "humanitarian rights" found.
## 1 instances of "humanity right" found.

# Keep only terms that have embeddings available
dfm_hr <- dfm_match(dfm_hr, features = rownames(pre_trained))

# Drop empty context-docs (no remaining tokens)
dfm_hr <- dfm_subset(dfm_hr, ntoken(dfm_hr) > 0)
stopifnot(ndoc(dfm_hr) > 0)


## ----------------------------------------------------------
## 4. ALC (à la carte) embeddings for each HR context
## ----------------------------------------------------------

dem_hr <- dem(
  x                = dfm_hr,
  pre_trained      = pre_trained,
  transform        = TRUE,
  transform_matrix = transform_mat,
  verbose          = TRUE
)


## ----------------------------------------------------------
## 5. Average to country-level human-rights vectors
##    (one HR embedding per country)
## ----------------------------------------------------------

wv_country <- dem_group(
  dem_hr,
  groups = dem_hr@docvars$country_org
)

Now we have country-level ALC embeddings for “human rights”. Next, we can inspect the nearest neighbors of each country’s HR vector in the original embedding space to see how different countries discuss human rights in the UNSC.

We could do this because we have a pretrained word embedding matrix (GloVe) that covers a large vocabulary. By computing cosine similarity between each country’s HR vector and all words in the pretrained space, we can identify the top-N nearest neighbors for each country. This gives us insight into the specific terms and concepts that are semantically associated with “human rights” for each country in the UNSC context.

## ----------------------------------------------------------
## 6. Nearest-neighbor words for each country HR embedding
##    (which words are closest to each country’s HR vector?)
## ----------------------------------------------------------

# Tunable: how many nearest neighbors per country
k_top <- 10

# 6.1 Normalize pretrained embeddings row-wise (for cosine similarity)
#     W_unit[w, ] is the unit vector for word w.
row_norms <- sqrt(rowSums(pre_trained^2))
W_unit <- pre_trained / row_norms
W_unit[!is.finite(W_unit)] <- 0   # guard against division by zero / NaN


# 6.2 Helper: top-N nearest neighbors for a SINGLE vector
top_nns <- function(vec,
                    k       = k_top,
                    exclude = c("human", "right", "rights")) {
  
  # normalize input vector
  v <- as.numeric(vec)
  v <- v / sqrt(sum(v * v))
  
  # cosine similarity against all vocabulary words
  sims <- as.numeric(W_unit %*% v)
  names(sims) <- rownames(W_unit)
  
  # sort high to low
  sims <- sort(sims, decreasing = TRUE)
  
  # drop target words, stopwords, etc.
  bad  <- unique(c(exclude, stopwords("en")))
  keep <- setdiff(names(sims), bad)
  
  # return top-k
  sims[keep][seq_len(min(k, length(keep)))]
}


# 6.3 Apply to all countries: one table of neighbors
nn_tbl <- rownames(wv_country) |>
  map_dfr(function(cty) {
    sims <- top_nns(wv_country[cty, , drop = FALSE])
    tibble(
      country_org = cty,
      rank        = seq_along(sims),
      neighbor    = names(sims),
      cosine      = as.numeric(sims)
    )
  })


## ----------------------------------------------------------
## 7. Inspect / save NN table
## ----------------------------------------------------------

nn_tbl <- nn_tbl |>
  arrange(country_org, rank)
nn_tbl

## # A tibble: 50 × 4
##    country_org  rank neighbor      cosine
##    <chr>       <int> <chr>          <dbl>
##  1 China           1 safeguarding   0.502
##  2 China           2 humanitarian   0.478
##  3 China           3 peacebuilding  0.448
##  4 China           4 unsc           0.437
##  5 China           5 1244           0.431
##  6 China           6 post-conflict  0.428
##  7 China           7 resolutions    0.427
##  8 China           8 safeguard      0.427
##  9 China           9 ensuring       0.416
## 10 China          10 norms          0.415
## # ℹ 40 more rows

## ----------------------------------------------------------
## 8. Summarise nearest-neighbor words per country
## ----------------------------------------------------------

library(knitr)
library(kableExtra)


# nn_tbl contains: country_org, rank, neighbor, cosine

nn_summary <- nn_tbl %>%
  arrange(country_org, rank) %>%
  group_by(country_org) %>%
  summarise(
    top_words = paste(
      sprintf("%d. %s", rank, neighbor),
      collapse = ", "
    ),
    .groups = "drop"
  )

## ----------------------------------------------------------
## 9. Print summary table nicely
## ----------------------------------------------------------

kable(
  nn_summary,
  align   = c("l", "l"),
  caption = "Top 10 nearest-neighbor words to each country's human-rights embedding"
) %>%
  kable_styling(
    full_width = FALSE,
    position   = "center",
    font_size  = 12
  )

Top 10 nearest-neighbor words to each country’s human-rights embedding
country_org	top_words
China	safeguarding, 2. humanitarian, 3. peacebuilding, 4. unsc, 5. 1244, 6. post-conflict, 7. resolutions, 8. safeguard, 9. ensuring, 10. norms
France	violations, 2. norms, 3. respecting, 4. breaches, 5. ensuring, 6. violation, 7. safeguarding, 8. humanitarian, 9. flagrant, 10. abuses
Russian Federation	violations, 2. norms, 3. safeguarding, 4. humanitarian, 5. breaches, 6. respecting, 7. resolutions, 8. ensuring, 9. flagrant, 10. human-rights
United Kingdom Of Great Britain And Northern Ireland	violations, 2. abuses, 3. ensuring, 4. safeguarding, 5. norms, 6. respecting, 7. breaches, 8. flagrant, 9. impunity, 10. violation
United States Of America	violations, 2. abuses, 3. breaches, 4. respecting, 5. flagrant, 6. norms, 7. impunity, 8. safeguarding, 9. egregious, 10. ensuring

Temporal variation

To check the temporal variation in human-rights semantics, we can compute country-year level ALC embeddings for “human rights” and then analyze how these embeddings change over time. One way to visualize this is by calculating the cosine similarity of each country-year embedding to the “human rights” anchor vector, which has been derived from the average embedding of all contexts mentioning “human rights” across the entire corpus.

The cosine similarity in the plot measures how closely a specific country-year’s meaning aligns with the global anchor. A higher similarity indicates that the country’s discourse on human rights in that year is more consistent with the overall global discourse on human rights.

## ----------------------------------------------------------
## X. Country–year HR embeddings & similarity to global anchor
## ----------------------------------------------------------

# 1) Group to country–year ALC embeddings
wv_cy <- dem_group(
  dem_hr,
  groups = paste(dem_hr@docvars$country_org, dem_hr@docvars$year, sep = "_")
)

cy_df <- tibble(
  cy = rownames(wv_cy),
  U  = I(asplit(wv_cy, 1))          # list-column of embedding vectors
) |>
  tidyr::separate(cy, into = c("country_org", "year"), sep = "_", remove = FALSE) |>
  mutate(year = as.integer(year))

# 2) Helper functions for norms & cosine similarity
l2       <- function(v) sqrt(sum(v * v))
unit_vec <- function(v) v / l2(v)
cos_sim  <- function(a, b) sum(a * b) / (l2(a) * l2(b))


# 3) Global HR anchor: average over all HR contexts
hr_anchor <- dem_hr |>
  as.matrix() |>
  colMeans() |>
  unit_vec()


# 4) Cosine similarity of each country–year vector to the HR anchor
cy_df <- cy_df |>
  mutate(
    hr_sim = map_dbl(U, ~ cos_sim(.x, hr_anchor))
  )

And we could see that China and Russia are deviating from the global human-rights discourse more than other countries, especially during certain conflict events.

# 5) Key conflict events to mark on the time axis
conflict_events <- tibble::tribble(
  ~year, ~label,
  1994L, "Rwanda",
  1999L, "Kosovo",
  2003L, "Iraq",
  2022L, "Russia–Ukraine"
)


# 6) Plot: HR similarity over time, by country, with conflict markers
ggplot(cy_df, aes(x = year, y = hr_sim, color = country_org, group = country_org)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.6) +
  geom_vline(
    data        = conflict_events,
    aes(xintercept = year),
    linetype    = "dashed",
    alpha       = 0.6,
    inherit.aes = FALSE
  ) +
  labs(
    title    = "Semantic alignment with the global human-rights anchor",
    subtitle = "Cosine similarity of country–year ALC embeddings to the corpus-wide 'human rights' centroid",
    x        = "Year",
    y        = "Cosine similarity"
  ) +
  theme_bw() +
  theme(
    legend.title      = element_blank(),
    panel.grid.minor  = element_blank()
  )

Using all UNSC meeting as anchor

Now we make the similar plot but using the average embedding across all UNSC speeches as the anchor point, instead of just the “human rights” contexts. This gives a different perspective on how each country’s human-rights discourse aligns with the overall UNSC discourse each year.

## ----------------------------------------------------------
## X. Distance to external HR anchor (avg_embedding_hr)
## ----------------------------------------------------------

avg_embedding_hr <- readRDS("/Users/au760950/Desktop/Aarhus_Y1/16_UnText/1_UN_data/security_1992:2023/avg_embedding_hr.rds")

anchor <- avg_embedding_hr

stopifnot(length(anchor) == length(cy_df$U[[1]]))  # sanity check


# 2) Cosine similarity / distance to this anchor
cos_sim_anchor <- function(v) cos_sim(v, anchor)

cy_df_full <- cy_df |>
  mutate(
    cos_sim = map_dbl(U, cos_sim_anchor),  # similarity to HR anchor
    dist    = 1 - cos_sim                  # cosine "distance"
  )




## ----------------------------------------------------------
## Y. Plot: country–year distance to HR anchor over time
## ----------------------------------------------------------

ggplot(cy_df_full, aes(x = year, y = dist, color = country_org, group = country_org)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.6) +
  geom_vline(
    data        = conflict_events,
    aes(xintercept = year),
    linetype    = "dashed",
    alpha       = 0.6,
    inherit.aes = FALSE
  ) +
  labs(
    title  = "Country-year distance to average human-rights embedding",
    x      = "Year",
    y      = "Cosine distance to HR anchor",
    color  = "Country"
  ) +
  theme_bw() +
  theme(
    legend.title     = element_blank(),
    panel.grid.minor = element_blank()
  )

UMAP

Now, this plot shows a UMAP projection of country–year embeddings for “human rights” discourse. Each point represents the ALC embedding of a given country and year, with color indicating the country and labels showing the year. The UMAP reduces high-dimensional semantic information to two dimensions based on cosine similarity, so points that appear closer together reflect more similar meanings of “human rights” across countries or years. Clusters suggest shared or converging discourses, while more distant points indicate divergent national interpretations over time

library(uwot)

# 1) Build embedding matrix from list-column U
emb_mat <- cy_df$U |>
  do.call(what = rbind)

rownames(emb_mat) <- paste(cy_df$country_org, cy_df$year, sep = "_")

# 2) Run UMAP in cosine space
set.seed(1001)

um <- umap(
  emb_mat,
  n_neighbors = 10,
  min_dist    = 0.2,
  metric      = "cosine"
  # ret_model = TRUE  # <-- turn this on later if you want to add anchor points
)

# 3) Attach UMAP coordinates back to cy_df
cy_df <- cy_df |>
  mutate(
    UMAP1 = um[, 1],
    UMAP2 = um[, 2]
  )

# 4) Plot: country–year trajectories in UMAP space
ggplot(cy_df, aes(x = UMAP1, y = UMAP2, color = country_org)) +
  # geom_path(aes(group = country_org), linewidth = 1, alpha = 0.7) +  # uncomment for trajectories
  geom_point(size = 2, alpha = 0.9) +
  ggrepel::geom_text_repel(
    aes(label = year),
    size         = 3,
    max.overlaps = 40,
    show.legend  = FALSE
  ) +
  labs(
    title    = "UMAP projection of country–year human-rights embeddings",
    subtitle = "Each point = ALC embedding of 'human rights' discourse in that country–year",
    x        = "UMAP1",
    y        = "UMAP2",
    color    = "Country"
  ) +
  theme_bw(base_size = 12) +
  theme(
    legend.title     = element_blank(),
    panel.grid.minor = element_blank()
  )

## 1) Define semantic anchor term sets
anchor_terms <- list(
  # Individual rights & freedoms (commented out if you don't want it plotted)
  # freedom = c(
  #   "freedom", "liberty", "autonomy", "rights", "consent", "choice"
  # ),
  
  democracy = c(
    "democracy", "democratic", "elections", "parliament",
    "vote", "voting", "accountability", "representation", "pluralism"
  ),
  
  justice = c(
    "justice", "rule", "law", "fairness", "due", "process",
    "accountability", "oversight", "transparency"
  ),
  
  equality = c(
    "equality", "equal", "nondiscrimination", "inclusion",
    "diversity", "equitable", "fairness"
  ),
  
  civil_liberties = c(
    "speech", "expression", "assembly", "association",
    "press", "privacy", "religion", "conscience"
  ),
  
  sovereignty = c(
    "sovereignty", "independence", "territorial", "integrity",
    "self", "determination", "noninterference", "jurisdiction"
  ),
  
  authoritarian = c(
    "control", "censorship", "surveillance", "coercion",
    "repression", "authority", "dominance"
  ),
  
  welfare = c(
    "health", "education", "housing", "employment",
    "poverty", "development", "welfare", "social", "services"
  ),
  
  international_law = c(
    "treaty", "convention", "charter", "norms",
    "obligations", "ratification", "monitoring", "compliance"
  )
)


## 2) Helper to compute phrase embedding = mean of word vectors
get_phrase_embedding <- function(words, emb_mat) {
  words <- intersect(words, rownames(emb_mat))
  if (length(words) == 0L) {
    return(rep(NA_real_, ncol(emb_mat)))
  }
  emb <- emb_mat[words, , drop = FALSE]
  colMeans(emb)
}

# raw word-space embeddings for anchors (same space as pre_trained)
anchor_raw_mat <- t(
  vapply(
    anchor_terms,
    get_phrase_embedding,
    numeric(ncol(pre_trained)),
    emb_mat = pre_trained
  )
)

# If your country-year embeddings U are in transformed space (e.g. dem_hr already used transform_mat),
# you should apply the same transform here. If not, keep them in the original space.
# Example if needed:
# anchor_mat <- anchor_raw_mat %*% transform_mat

anchor_mat <- anchor_raw_mat  # assuming wv_cy / cy_df$U already match this space

## 3) Build embedding matrix for country–year points
emb_mat <- cy_df$U |>
  do.call(what = rbind)

rownames(emb_mat) <- paste(cy_df$country_org, cy_df$year, sep = "_")


## 4) Fit UMAP model and project country–year embeddings
set.seed(1001)
um_model <- umap(
  emb_mat,
  n_neighbors = 10,
  min_dist    = 0.2,
  metric      = "cosine",
  ret_model   = TRUE
)

um_coords <- um_model$embedding

cy_df <- cy_df |>
  mutate(
    UMAP1 = um_coords[, 1],
    UMAP2 = um_coords[, 2]
  )


## 5) Project anchor concept embeddings into the same UMAP space
# Drop anchors that ended up as all-NA (no vocabulary overlap)
valid_anchor <- apply(anchor_mat, 1, function(row) all(is.finite(row)))
anchor_mat_use <- anchor_mat[valid_anchor, , drop = FALSE]

anchor_umap <- umap_transform(anchor_mat_use, um_model)

anchor_df <- tibble(
  label = rownames(anchor_mat_use),
  UMAP1 = anchor_umap[, 1],
  UMAP2 = anchor_umap[, 2]
)


## 6) Plot: country–year points + semantic anchors
ggplot(cy_df, aes(x = UMAP1, y = UMAP2, color = country_org)) +
  geom_point(size = 2, alpha = 0.8) +
  ggrepel::geom_text_repel(
    aes(label = year),
    size         = 3,
    max.overlaps = 40,
    show.legend  = FALSE
  ) +
  # anchor points (black triangles)
  geom_point(
    data        = anchor_df,
    aes(x = UMAP1, y = UMAP2),
    inherit.aes = FALSE,
    size        = 3.5,
    shape       = 17
  ) +
  ggrepel::geom_label_repel(
    data        = anchor_df,
    aes(x = UMAP1, y = UMAP2, label = label),
    inherit.aes = FALSE,
    size        = 3,
    label.size  = 0.2,
    fill        = "white"
  ) +
  labs(
    title    = "UMAP projection of country–year human-rights embeddings",
    subtitle = "Anchors show positions of key concepts in the same semantic space",
    x        = "UMAP1",
    y        = "UMAP2",
    color    = "Country"
  ) +
  theme_bw(base_size = 12) +
  theme(
    legend.title     = element_blank(),
    panel.grid.minor = element_blank()
  )

PCA

pca_hr <- prcomp(emb_mat, scale. = TRUE, center = TRUE)

# 3) Attach first two PCs to cy_df
cy_df <- cy_df |>
  mutate(
    PCA1 = pca_hr$x[, 1],
    PCA2 = pca_hr$x[, 2]
  )

# 4) Extract explained variance for axis labels
pca_var_expl <- summary(pca_hr)$importance[2, 1:2] * 100

# 5) Plot: semantic map via PCA
ggplot(cy_df, aes(x = PCA1, y = PCA2, color = country_org, label = year)) +
  geom_point(size = 2, alpha = 0.8) +
  # geom_path(aes(group = country_org), linewidth = 0.7, alpha = 0.5) +  # uncomment for trajectories
  ggrepel::geom_text_repel(
    size         = 2.8,
    max.overlaps = 10,
    show.legend  = FALSE
  ) +
  labs(
    title    = "Semantic map of 'human rights' by country and year (PCA)",
    subtitle = "Each point = country–year ALC embedding",
    x        = sprintf("PCA 1 (%.1f%% variance)", pca_var_expl[1]),
    y        = sprintf("PCA 2 (%.1f%% variance)", pca_var_expl[2])
  ) +
  theme_bw(base_size = 12) +
  theme(
    legend.title     = element_blank(),
    panel.grid.minor = element_blank()
  )

## Warning: ggrepel: 66 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## 1) Identify extreme country–years on PCA axes -----------------------------

extremes <- list(
  PC1_pos = cy_df %>% arrange(desc(PCA1)) %>% slice_head(n = 5),
  PC1_neg = cy_df %>% arrange(PCA1)       %>% slice_head(n = 5),
  PC2_pos = cy_df %>% arrange(desc(PCA2)) %>% slice_head(n = 5),
  PC2_neg = cy_df %>% arrange(PCA2)       %>% slice_head(n = 5)
)


## 2) Extract sentences mentioning “human rights” ----------------------------

hr_docs <- merged_df %>%
  filter(str_detect(content, regex("human right", ignore_case = TRUE)))

hr_sentences <- hr_docs %>%
  mutate(
    sentences = map(content, ~ tokenize_sentences(.x)[[1]])
  ) %>%
  select(country_org, year, sentences) %>%
  tidyr::unnest(sentences) %>%
  rename(sentence = sentences) %>%
  filter(str_detect(sentence, regex("human right", ignore_case = TRUE))) %>%
  mutate(
    sentence    = str_squish(sentence),
    cy          = paste(country_org, year, sep = "_")
  )


## 3) Tokenize HR sentences and build DFM ------------------------------------

toks_hr <- tokens(
  tolower(hr_sentences$sentence),
  remove_punct   = TRUE,
  remove_numbers = TRUE
)

# remove stopwords and very short tokens (e.g., "un", "hr")
toks_hr <- tokens_select(
  toks_hr,
  pattern   = stopwords("en"),
  selection = "remove"
)

toks_hr <- tokens_keep(
  toks_hr,
  pattern   = "^[a-z]{3,}$",  # keep only alphabetic tokens of length >= 3
  valuetype = "regex"
)

# attach docvars
docvars(toks_hr, "cy")          <- hr_sentences$cy
docvars(toks_hr, "country_org") <- hr_sentences$country_org
docvars(toks_hr, "year")        <- hr_sentences$year

dfm_hr <- dfm(toks_hr)

# sanity check
stopifnot("cy" %in% names(docvars(dfm_hr)))


## 4) Helper: keyness terms for a target set of country–years ----------------

get_keyness_terms <- function(dfm_hr, target_cy, n = 15) {
  dfm_target <- dfm_subset(dfm_hr, cy %in% target_cy)
  dfm_rest   <- dfm_subset(dfm_hr, !cy %in% target_cy)
  
  # guard: if too few docs, return empty
  if (ndoc(dfm_target) == 0L || ndoc(dfm_rest) == 0L) {
    return(character(0))
  }
  
  dfm_both <- rbind(dfm_target, dfm_rest)
  
  textstat_keyness(dfm_both, target = seq_len(ndoc(dfm_target))) %>%
    filter(chi2 > 0) %>%
    slice_max(order_by = chi2, n = n) %>%
    pull(feature)
}


## 5) Compute key terms for each PCA pole ------------------------------------

key_pc1_pos <- get_keyness_terms(dfm_hr, extremes$PC1_pos$cy, n = 10)
key_pc1_neg <- get_keyness_terms(dfm_hr, extremes$PC1_neg$cy, n = 10)
key_pc2_pos <- get_keyness_terms(dfm_hr, extremes$PC2_pos$cy, n = 10)
key_pc2_neg <- get_keyness_terms(dfm_hr, extremes$PC2_neg$cy, n = 10)


## 6) Summarise axis interpretations in a table ------------------------------

pca_axis_tbl <- tibble(
  Axis = c("PCA1 (+)", "PCA1 (−)", "PCA2 (+)", "PCA2 (−)"),
  `Representative country–years` = c(
    paste(extremes$PC1_pos$cy, collapse = ", "),
    paste(extremes$PC1_neg$cy, collapse = ", "),
    paste(extremes$PC2_pos$cy, collapse = ", "),
    paste(extremes$PC2_neg$cy, collapse = ", ")
  ),
  `Characteristic terms` = c(
    paste(key_pc1_pos, collapse = ", "),
    paste(key_pc1_neg, collapse = ", "),
    paste(key_pc2_pos, collapse = ", "),
    paste(key_pc2_neg, collapse = ", ")
  )
)


## 7) Display nicely ---------------------------------------------------------

pca_axis_tbl %>%
  kable(
    caption = paste(
      "Human-rights semantic dimensions identified from PCA of ALC embeddings.",
      "Each pole lists extreme country–years and top keyness words in their HR discourse."
    ),
    align    = c("l", "l", "l"),
    booktabs = TRUE
  ) %>%
  kable_styling(
    full_width        = FALSE,
    bootstrap_options = c("striped", "hover", "condensed"),
    font_size         = 12
  )

Human-rights semantic dimensions identified from PCA of ALC embeddings. Each pole lists extreme country–years and top keyness words in their HR discourse.
Axis	Representative country–years	Characteristic terms
PCA1 (+)	Russian Federation_1998, China_2004, United Kingdom Of Great Britain And Northern Ireland_1993, United States Of America_1993, China_2005	serbs, bosnian, dated, belgrade, forcible, letter, aggressor, approving, bamyan, clearing, commencement, draconian, engulfed, ferraro, geraldine, hannay, honourable, horrified, island, justifiable, richardson, sandjak, spared, vojvodina, wound
PCA1 (−)	Russian Federation_1994, China_1994, China_2012, China_1995, China_2015	china, reservations, min, wang, chinese, frequent, synergy, ivan, advantages, competences, conceal, dispatching, exchange, generat, generate, iwould, lawful, yong, zhao
PCA2 (+)	China_2008, China_2007, China_2002, China_2009, China_2001	reconstruction, unami, bonuca, reconciliation, hope, areas, economic, variety, iraqi, expertise
PCA2 (−)	Russian Federation_1994, United Kingdom Of Great Britain And Northern Ireland_1994, China_1995, China_1996, United Kingdom Of Great Britain And Northern Ireland_2001	notes, georgia, mary, commissioner, high, consultation, robinson, might, floor, indonesian

Embedding Regression

# two factor covariates
set.seed(1001)

# now run conText regression
model1 <- conText(
  formula = "human rights" ~ country_org,
  data = toks,
  pre_trained = pre_trained,
  transform = TRUE,
  transform_matrix = transform_mat,
  jackknife = FALSE,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  window = 6,
  case_insensitive = TRUE,
  verbose = TRUE
)

## 7201 instances of "human rights" found.
## total observations included in regression: 7201 
## starting permutations 
## done with permutations 
##                                                          coefficient
## 1                                                 country_org_France
## 2                                   `country_org_Russian Federation`
## 3 `country_org_United Kingdom Of Great Britain And Northern Ireland`
## 4                             `country_org_United States Of America`
##   normed.estimate.orig normed.estimate.deflated normed.estimate.beta.error.null
## 1           0.02541117              0.023118532                     0.002292638
## 2           0.01090729              0.008271055                     0.002636239
## 3           0.02549858              0.023213014                     0.002285566
## 4           0.03758621              0.035376881                     0.002209330
##      n n_obs covariate_mean p.value
## 1 7201  7201      0.2351062       0
## 2 7201  7201      0.1327593       0
## 3 7201  7201      0.2539925       0
## 4 7201  7201      0.3212054       0

Here, I visualize the normed ALC coefficients for each country from the conText model. These coefficients represent the semantic shift in the meaning of “human rights” associated with each country relative to the overall corpus. A higher normed coefficient indicates a greater deviation in how that country discusses “human rights” compared to the baseline, China. According to the framework of Rodríguez et al. (2023), these normed coefficients capture the magnitude of semantic shift. They do not have an absolute interpretation but can be compared meaningfully across countries.

nc_country <- model1@normed_coefficients %>%
  as_tibble() %>%
  mutate(
    coefficient = gsub("`", "", coefficient),
    country_org = sub("^country_org_", "", coefficient)
  )

ggplot(nc_country,
       aes(x = reorder(country_org, normed.estimate.deflated),
           y = normed.estimate.deflated)) +
  geom_point(size = 3, color = "steelblue") +
  coord_flip() +
  labs(
    title = "Differences in the meaning of 'human rights' by country",
    subtitle = "Normed ALC coefficients from conText model (baseline = China)",
    x = "Country",
    y = "Normed coefficient (semantic shift magnitude)"
  ) +
  theme_bw(base_size = 13)

Sentence Embedding

Here I use a pretrained sentence embedding to compute sentence-level embeddings for all sentences mentioning “human rights” in the UNSC speeches. I then average these sentence embeddings to the country-year level to obtain a representation of how each country discusses human rights over time. Finally, I analyze the semantic alignment of these country-year embeddings with a curated set of human-rights seed phrases, as well as with the overall global discourse on human rights.

library(reticulate)

setup_embed_env <- function(
  envname = "hr_embed_env",
  pkgs = c("sentence-transformers","numpy","pandas","tqdm"),
  model_name = "sentence-transformers/all-MiniLM-L6-v2"
) {
  if (!envname %in% virtualenv_list()) {
    virtualenv_create(envname)
  }
  py_install(packages = pkgs, envname = envname, pip = TRUE)
  use_virtualenv(envname, required = TRUE)

  st <- import("sentence_transformers", convert = TRUE)
  model <- st$SentenceTransformer(model_name)

  list(
    model = model,
    embed_dim = as.integer(model$get_sentence_embedding_dimension())
  )
}

env <- setup_embed_env()

## Using virtual environment 'hr_embed_env' ...

model <- env$model
embed_dim <- env$embed_dim

The goal of this section is to transform raw sentences mentioning “human rights” into numerical vector representations (embeddings) using a pretrained sentence-transformer model (like all-MiniLM-L6-v2).

These embeddings capture semantic meaning — so that sentences with similar ideas end up close together in high-dimensional space.

encode_texts <- function(texts,
                         batch_size = 32,
                         normalize = TRUE) {
  if (length(texts) == 0) {
    return(matrix(numeric(0), nrow = 0, ncol = embed_dim))
  }
  embs <- model$encode(
    texts,
    batch_size = as.integer(batch_size),
    show_progress_bar = FALSE,
    normalize_embeddings = normalize
  )
  embs_mat <- if (is.null(dim(embs))) {
    matrix(as.numeric(embs), nrow = 1)
  } else {
    as.matrix(embs)
  }
  embs_mat
}

l2       <- function(v) sqrt(sum(v * v))
unit_vec <- function(v) if (all(v == 0)) v else v / l2(v)
cos_sim  <- function(a, b) {
  da <- l2(a); db <- l2(b)
  if (da == 0 || db == 0) return(NA_real_)
  sum(a * b) / (da * db)
}

hr_docs <- merged_df %>%
  filter(str_detect(content, regex("human right", ignore_case = TRUE)))

hr_sentences <- hr_docs %>%
  mutate(sentences = map(content, ~ tokenize_sentences(.x)[[1]])) %>%
  select(country_org, year, sentences) %>%
  unnest(sentences) %>%
  rename(sentence = sentences) %>%
  filter(str_detect(sentence, regex("human right", ignore_case = TRUE))) %>%
  mutate(sentence = str_squish(sentence))

# if this is empty for some country/years, they'll just drop out downstream

# vector of all sentences
all_sents <- hr_sentences$sentence

all_embs_mat <- encode_texts(all_sents, batch_size = 64, normalize = TRUE)

# attach embeddings back
hr_sentences$embedding <- split(
  all_embs_mat,
  rep(seq_len(nrow(all_embs_mat)), each = 1)
)

Here I take all the “human rights” sentences for a given country and year, and averaging their embeddings. This yields one mean vector per country–year — the semantic centroid of how that country framed human rights in that year. It smooths out individual-sentence noise, letting you track broad rhetorical or ideological trends over time.

Then I create create two types of reference embeddings to serve as comparison points.The first one is a Seed anchor (hr_anchor_seed): Computed from a small, curated set of canonical human-rights phrases (like “freedom from torture”, “rule of law”). This represents a normative, idealized concept of human rights, grounded in international legal and rights-based language

The second one is similar to my previous logic. I computed as the average embedding across all “human rights” sentences in your corpus. It represents the empirical center of how “human rights” is discussed globally — the data-driven “mainstream” meaning.

country_year_hr <- hr_sentences %>%
  group_by(country_org, year) %>%
  summarise(
    hr_embedding = list(colMeans(do.call(rbind, embedding))),
    .groups = "drop"
  )

hr_seed_phrases <- c(
  "human rights",
  "international human rights law",
  "civil and political rights",
  "economic, social, and cultural rights",
  "protection of human rights defenders",
  #"freedom from torture",
  #"freedom of expression",
  "rule of law and accountability for violations",
  "non-discrimination and equality before the law"
)

hr_seed_embs <- encode_texts(hr_seed_phrases, batch_size = 16, normalize = TRUE)
hr_anchor_seed <- unit_vec(colMeans(hr_seed_embs))
hr_anchor_global <- unit_vec(colMeans(all_embs_mat))

country_year_hr <- country_year_hr %>%
  mutate(
    hr_sim_seed   = sapply(hr_embedding, function(u) cos_sim(u, hr_anchor_seed)),
    hr_sim_global = sapply(hr_embedding, function(u) cos_sim(u, hr_anchor_global)))

library(ggplot2)

ggplot(country_year_hr, aes(x = year, y = hr_sim_seed, color = country_org)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.8) +
  labs(
    title = "Semantic alignment with the human-rights seed anchor",
    subtitle = "Cosine similarity of country–year HR embeddings to curated HR anchor",
    x = "Year",
    y = "Cosine similarity",
    color = "Country"
  ) +
  theme_bw() +
  theme(legend.title = element_blank())

ggplot(country_year_hr, aes(year, hr_sim_global, color = country_org)) +
  geom_line() + geom_point() +
  labs(
    title = "Alignment with global HR discourse (alternative anchor)",
    x = "Year", y = "Cosine similarity"
  ) +
  theme_bw() +
  theme(legend.title = element_blank())

library(uwot)

emb_mat_cy <- do.call(rbind, country_year_hr$hr_embedding)
rownames(emb_mat_cy) <- paste(country_year_hr$country_org, country_year_hr$year, sep = "_")

set.seed(123)
um <- umap(
  emb_mat_cy,
  n_neighbors = 10,
  min_dist = 0.2,
  metric = "cosine"
)

country_year_hr$UMAP1 <- um[,1]
country_year_hr$UMAP2 <- um[,2]

ggplot(country_year_hr, aes(UMAP1, UMAP2, color = country_org)) +
  #geom_path(aes(group = country_org), linewidth = 1, alpha = 0.7) +
  geom_point(size = 2, alpha = 0.9) +
  ggrepel::geom_text_repel(aes(label = year), size = 3, max.overlaps = 40) +
  theme_bw() +
  labs(
    title = "UMAP projection of country–year 'human rights' embeddings",
    subtitle = "Sentence-transformer embeddings; trajectories show semantic drift",
    x = "UMAP1", y = "UMAP2",
    color = "Country"
  )

To characterize the semantic field of “human rights” for each country, we computed the ten nearest lexical neighbors to each country’s mean human-rights embedding. These terms represent the vocabulary most semantically aligned with each country’s human-rights discourse, allowing qualitative interpretation of differences in framing across countries.

library(quanteda)

tokens_hr <- tokens(
  tolower(hr_sentences$sentence),
  remove_punct = TRUE,
  remove_numbers = TRUE
) %>%
  tokens_select(pattern = stopwords("en"), selection = "remove")

vocab <- unique(unlist(tokens_hr))
vocab <- vocab[nchar(vocab) > 3]

word_embs <- encode_texts(vocab, batch_size = 64, normalize = TRUE)
rownames(word_embs) <- vocab

get_top_words <- function(vec, k = 10) {
  v <- unit_vec(vec)
  sims <- as.numeric(word_embs %*% v)
  names(sims) <- rownames(word_embs)
  drop_terms <- c("human","right","rights","humanrights","human-rights")
  sims <- sims[!names(sims) %in% drop_terms]
  sims <- sort(sims, decreasing = TRUE)
  head(sims, k)
}

nn_tbl <- country_year_hr %>%
  group_by(country_org) %>%
  summarise(
    top_words = list(names(get_top_words(hr_embedding[[1]], k = 10))),
    .groups = "drop"
  )

library(kableExtra)

nn_tbl %>%
  mutate(top_words = sapply(top_words, function(w) paste(w, collapse = ", "))) %>%
  kable(
    caption = "Top 10 nearest terms to each country's human-rights embedding (seed-anchor space)"
  ) %>%
  kable_styling(full_width = FALSE, position = "center", font_size = 11)

Top 10 nearest terms to each country’s human-rights embedding (seed-anchor space)
country_org	top_words
China	rights-compliant, rights-based, rights-respecting, rights-centred, non-peacekeeping, rights-focused, protection-of-civilians, peacekeeping, sovereignty, nations-mandated
France	peacekeeping, non-peacekeeping, protection-of-civilians, humanitarian, detainees, humanitarians, humanitarianism, rights-compliant, anti-genocide, civilians
Russian Federation	protection-of-civilians, peacekeeping, rights-respecting, non-peacekeeping, protection-of-civilian, inhumanity, rights-compliant, detainees, rights-centred, humanitarian
United Kingdom Of Great Britain And Northern Ireland	peacekeeping, protection-of-civilians, non-peacekeeping, rights-respecting, rights-compliant, peace-keeping, protection-of-civilian, detainees, rights-centred, freedoms
United States Of America	genocide, humanitarianism, anti-genocide, atrocities, detainees, humanitarian, peacekeeping, dictatorship, non-peacekeeping, humanitarians

Embedding procedures

We employed two complementary embedding approaches to analyze the semantic framing of “human rights” across United Nations Security Council speeches.

First, we applied à la carte (ALC) word embeddings following Rodríguez et al. (2023), which combine pretrained static embeddings (e.g., GloVe) with corpus-specific transformation matrices to derive context-adjusted word vectors. ALC embeddings capture micro-level shifts in the meaning of specific terms—in this case, how the concept “human rights” changes across countries and over time.

Second, we used sentence embeddings generated with the pretrained transformer model all-MiniLM-L6-v2 (via the sentence-transformers library). Sentence embeddings capture the contextual meaning of full sentences, allowing us to summarize each country–year’s discourse on human rights as a high-dimensional vector and to measure its alignment with a global human-rights anchor using cosine similarity. This method provides a macro-level view of cross-national and temporal variation in human-rights discourse.

UNSC-sematic check

Winnie

2025-12-03