Summary

This study analyzes United Nations General Debate (UNGD) speeches from 1946 to 2024 to identify latent thematic and geopolitical structures without predefined labels. Using a classic information‑retrieval pipeline—TF‑IDF vectorization, truncated singular value decomposition (LSA) for dimension reduction, and spherical k‑means clustering—we recover coherent clusters that align with major shifts in multilateral discourse. Cluster labels are derived with a class‑based TF‑IDF (c‑TF‑IDF) procedure, and robustness is supported by stability diagnostics across random initializations. The temporal distribution of clusters highlights clear transitions in agenda setting (e.g., post‑war/decolonization, Cold War realignments, development eras, sustainability and health shocks), while country‑level summaries show geographic concentration within clusters. The results demonstrate that simple, transparent linear methods can yield interpretable structure in large political text corpora.

Introduction

The UN General Debate is frequently described as a barometer of world opinion: heads of state and government outline priorities, narrate crises, and articulate international norms. This project asks whether unsupervised methods can reveal persistent thematic structures and historical transitions across nearly eight decades of debate.

Three questions guide the analysis. First, does the corpus exhibit clusterable structure rather than random variation? Second, which dimension‑reduction strategy and rank yield stable, interpretable clusters for high‑dimensional textual data? Third, how do the resulting clusters map onto recognizable periods and issues in UN discourse, and how do they vary across years and countries?

Data

Source and scope

Transcripts of UN General Debate Speeches were obtained from Harvard Dataverse.

Files follow the convention ISO3_SESSION_YEAR.txt (e.g., USA_75_2020.txt).

Ingestion and structure

A minimal loader extracts durable metadata—iso3, session, and year—from filenames and reads the full text into a single tibble (speeches_df). The ingestion process is audit‑friendly by design: it preserves raw text alongside all derived fields and avoids irreversible transformations at load time.

suppressPackageStartupMessages({
  library(readr)   # read_file()
  library(dplyr)   # mutate(), select(), count(), distinct()
  library(stringr) # str_*
  library(Matrix)
  library(tidyr)
  library(tibble)
  library(proxy)
  library(uwot)
  library(quanteda)# tidy, fast text ops (tokenize, stopwords, stemming)
  library(textclean)      # quick cleaning helpers for OCR artifacts
  library(purrr)   # map_chr(), map_dfr()
  library(cld3)        # language detection (fast, minimal)
  library(irlba) # irlba()
  library(hopkins)  # hopkins()
  library(mclust)
})
# ---- Configuration ----
ROOT_DIR <- "C:/Users/aaa/OneDrive/Dokumenty/Courses/Unsupervised learning/Projects/data/TXT"
OCR_TRANSLATION_YEAR_MIN <- 2024           # conservative placeholder
OUTPUT_RDS <- "speeches_1946_2024.rds"

# ---- Helpers ----
set.seed(42)
# Find all .txt files recursively under a root directory
find_speech_files <- function(root) {
  stopifnot(is.character(root), length(root) == 1L)

  files <- list.files(
    path = root,
    pattern = "\\.txt$",
    full.names = TRUE,
    recursive = TRUE
  )

  if (length(files) == 0L) {
    stop(
      sprintf("No .txt files found under '%s'. Check the path or dataset layout.", root)
    )
  }
  files
}

# Parse filename into iso3, session, year (expected: ISO3_SESSION_YEAR.txt)
parse_filename_meta <- function(path) {
  fn   <- basename(path)
  core <- str_remove(fn, "\\.txt$")
  parts <- str_split(core, "_", simplify = TRUE)

  if (ncol(parts) != 3L) {
    warning(sprintf("Skipping file with unexpected name pattern: %s", fn))
    return(tibble::tibble(
      file_path = path, iso3 = NA_character_, session = NA_integer_, year = NA_integer_
    ))
  }

  tibble::tibble(
    file_path = path,
    iso3      = parts[, 1],
    session   = suppressWarnings(as.integer(parts[, 2])),
    year      = suppressWarnings(as.integer(parts[, 3]))
  )
}

# Read a speech as a single UTF‑8 string and perform a basic whitespace squish
read_speech_text <- function(path) {
  text_raw <- readr::read_file(path)
  stringr::str_squish(text_raw)
}

# ---- Pipeline Steps ----

# 1) Build metadata table (one row per file)
build_meta <- function(file_paths) {
  purrr::map_dfr(file_paths, parse_filename_meta)
}

# 2) Attach text column (read each file)
attach_text <- function(meta_tbl) {
  meta_tbl %>%
    mutate(text = purrr::map_chr(file_path, read_speech_text))
}

# 3) Post-process flags and quick QC
finalize_df <- function(df, ocr_year_min = OCR_TRANSLATION_YEAR_MIN) {
  df %>%
    # Conservative indicator for possible OCR/translation in the latest session(s)
    mutate(suspected_ocr_or_translated = !is.na(year) & year >= ocr_year_min) %>%
    # Drop exact duplicate rows (rare, but safe)
    distinct() %>%
    # Flag empty texts
    mutate(is_empty = nchar(text, type = "chars", allowNA = FALSE) == 0L)
}

# 4) Quick summary (printed to console); objects are returned invisibly if needed
quick_diagnostics <- function(df) {
  message("---- Quick diagnostics ----")
  suppressMessages({
    print(df %>% count(year, name = "n") %>% arrange(year), n = 10)
    print(df %>% count(session, name = "n") %>% arrange(session), n = 10)
  })
  message("Summary(speeches_df):")
  print(summary(df))
  invisible(df)
}

# ---- Main ----
run_loader <- function(root_dir = ROOT_DIR, out_rds = OUTPUT_RDS) {
  message(sprintf("Scanning for .txt files under: %s", root_dir))
  speech_files <- find_speech_files(root_dir)

  message(sprintf("Found %d files. Parsing metadata ...", length(speech_files)))
  meta_df <- build_meta(speech_files)

  # (Optional) sanity checks on meta
  n_bad <- sum(is.na(meta_df$iso3) | is.na(meta_df$session) | is.na(meta_df$year))
  if (n_bad > 0L) {
    warning(sprintf("There are %d files with missing (iso3/session/year). See warnings above.", n_bad))
  }

  message("Reading text content (this may take a moment) ...")
  speeches_df <- attach_text(meta_df)

  message("Finalizing flags and quality checks ...")
  speeches_df <- finalize_df(speeches_df, ocr_year_min = OCR_TRANSLATION_YEAR_MIN)

  quick_diagnostics(speeches_df)

  message(sprintf("Saving RDS to: %s", out_rds))
  saveRDS(speeches_df, file = out_rds)

  message("Done.")
  speeches_df
}

# ---- Execute ----
# Run the loader; returns the tibble invisibly while also saving the RDS
speeches_df <- run_loader()
## Scanning for .txt files under: C:/Users/aaa/OneDrive/Dokumenty/Courses/Unsupervised learning/Projects/data/TXT
## Found 10952 files. Parsing metadata ...
## Reading text content (this may take a moment) ...
## Finalizing flags and quality checks ...
## ---- Quick diagnostics ----
## # A tibble: 79 × 2
##     year     n
##    <int> <int>
##  1  1946    39
##  2  1947    39
##  3  1948    39
##  4  1949    35
##  5  1950    44
##  6  1951    51
##  7  1952    43
##  8  1953    44
##  9  1954    42
## 10  1955    45
## # ℹ 69 more rows
## # A tibble: 79 × 2
##    session     n
##      <int> <int>
##  1       1    39
##  2       2    39
##  3       3    39
##  4       4    35
##  5       5    44
##  6       6    51
##  7       7    43
##  8       8    44
##  9       9    42
## 10      10    45
## # ℹ 69 more rows
## Summary(speeches_df):
##   file_path             iso3              session          year     
##  Length:10952       Length:10952       Min.   : 1.0   Min.   :1946  
##  Class :character   Class :character   1st Qu.:33.0   1st Qu.:1978  
##  Mode  :character   Mode  :character   Median :51.0   Median :1996  
##                                        Mean   :48.3   Mean   :1993  
##                                        3rd Qu.:65.0   3rd Qu.:2010  
##                                        Max.   :79.0   Max.   :2024  
##      text           suspected_ocr_or_translated  is_empty      
##  Length:10952       Mode :logical               Mode :logical  
##  Class :character   FALSE:10760                 FALSE:10952    
##  Mode  :character   TRUE :192                                  
##                                                                
##                                                                
## 
## Saving RDS to: speeches_1946_2024.rds
## Done.

Text Preprocessing (Light, Explicit, Reversible)

We create a working text column text_clean from the raw text and apply light, reversible operations: Unicode normalization; removal of URLs and layout artifacts; lowercasing; punctuation and symbol stripping; optional removal of standalone numerals; and stemming. For 2024 materials and any suspected OCR/translation cases, we apply conservative whitespace normalization. A lightweight language detection step (optional) flags non‑English content for awareness without excluding documents.

speeches_df <- readRDS("speeches_1946_2024.rds")



speeches_df <- speeches_df %>%
  mutate(text_clean = text)  # start from raw, keep original intact

speeches_df <- speeches_df %>%
  mutate(
    text_clean = textclean::replace_non_ascii(text_clean), # normalize odd chars
    text_clean = str_replace_all(text_clean, "\\s+", " ")  # collapse whitespace
  )


speeches_df <- speeches_df %>%
  mutate(
    text_clean = str_remove_all(text_clean, "https?://\\S+"),
    text_clean = str_remove_all(text_clean, "www\\.[A-Za-z0-9\\-]+\\.[A-Za-z]{2,}\\S*"),
    text_clean = str_replace_all(text_clean, "—|–", "-"),
    text_clean = str_replace_all(text_clean, "·|•", " "),
    text_clean = str_squish(text_clean)
  )

fix_ocr <- function(x){
  x %>%
    textclean::replace_white()            # normalizes excess spaces
}

speeches_df <- speeches_df %>%
  mutate(
    text_clean = ifelse(suspected_ocr_or_translated, fix_ocr(text_clean), text_clean)
  )


speeches_df <- speeches_df %>%
  mutate(text_clean = str_to_lower(text_clean))

speeches_df <- speeches_df %>%
  mutate(text_clean = str_replace_all(text_clean, "[^\\p{L}\\p{N}\\s’'-]", " "))


# remove standalone numbers (for TF-IDF path)
speeches_df <- speeches_df %>%
  mutate(text_clean = str_replace_all(text_clean, "\\b\\d+\\b", " "))
saveRDS(speeches_df, "speeches_preprocessed.rds")

Feature Construction (TF‑IDF)

We tokenize with quanteda, remove English stopwords, and apply stemming. A sparse document‑feature matrix (DFM) is trimmed with light frequency thresholds and transformed to TF‑IDF (log‑TF, inverse‑DF). Row vectors are L2‑normalized so that cosine similarity is equivalent to a dot product, which improves high‑dimensional contrast.

speeches_df <- readRDS("speeches_preprocessed.rds")
# Build TF‑IDF (sparse DTM)
corp <- quanteda::corpus(speeches_df, text_field = "text_clean")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE)
toks <- tokens_remove(toks, pattern = stopwords("en"))
toks <- tokens_wordstem(toks, language = "english")

dfm_mat <- dfm(toks)
dfm_mat <- dfm_trim(dfm_mat, min_termfreq = 5, min_docfreq = 3)
tfidf   <- dfm_tfidf(dfm_mat, scheme_tf = "logcount", scheme_df = "inverse")

# L2-normalize document vectors so cosine ≈ dot product
row_norms <- sqrt(rowSums(tfidf^2))
tfidf_l2  <- Diagonal(x = 1 / pmax(row_norms, .Machine$double.eps)) %*% tfidf

saveRDS(tfidf_l2, "dfm_tfidf.rds")

Pre‑clustering Diagnostics (Clusterability / Tendency)

We assess cluster tendency prior to fitting algorithms. A randomized truncated SVD to 50 components stabilizes distances, and the Hopkins statistic is computed on this low‑rank projection. Values near one support aggregation (clusterability), while values near one‑half suggest spatial randomness.

# Light projection to stabilize high-D distances (e.g., 50 components)
svd_50 <- irlba(tfidf_l2, nv = 50)
X50 <- svd_50$u %*% diag(svd_50$d)


H <- hopkins(X50, m = nrow(X50) %/% 10)  # ≈10% sample
H
## [1] 1

Dimension Reduction

Choice of method

We employ truncated SVD (LSA) on TF‑IDF because it is fast on sparse matrices, denoises lexical variation, and provides a linear subspace in which prototype‑based clustering behaves well. In text applications, LSA is a robust baseline that preserves interpretability while reducing dimensionality.

How Many Components? (Rank Selection)

To avoid ad‑hoc settings, we evaluate a grid of candidate ranks (e.g., 50, 100, 150, 200) and select k via clustering stability. For each k, we project documents to the LSA space, L2‑normalize rows (spherical geometry), fit k‑means multiple times from independent initializations, and compute the mean Adjusted Rand Index (ARI) across all run pairs. The final k is chosen as the smallest rank at or near the stability peak (the “sweet spot” where adding components does not materially increase stability).

library(mclust)  # adjustedRandIndex
# Choose ranks to test; you can expand this grid.
k_grid <- c(50, 100, 150, 200)
# Helper: row L2-normalization (spherical / cosine geometry)
normalize_rows <- function(M) {
  rs <- sqrt(rowSums(M^2))
  rs[rs == 0] <- 1
  M / rs
}

# Helper: spherical k-means stability across runs (fixed K)
km_stability <- function(Z, K = 8, runs = 10, nstart_each = 10, itermax = 300, algorithm = "Lloyd") {
  labs <- replicate(runs, {
    km <- kmeans(Z, centers = K, nstart = nstart_each, iter.max = itermax, algorithm = algorithm)
    km$cluster
  })
  comb <- combn(ncol(labs), 2)
  mean(apply(comb, 2, function(p) mclust::adjustedRandIndex(labs[, p[1]], labs[, p[2]])))
}

# Helper: choose K by stability (grid)
pick_K_by_stability <- function(Zs, K_grid = 6:12) {
  stab <- sapply(K_grid, function(K) km_stability(Zs, K = K, runs = 10, nstart_each = 10, itermax = 300))
  tibble(K = K_grid, mean_ARI = stab) |> arrange(desc(mean_ARI))
}

# Helper: cosine-silhouette (sanity check, not primary for text)
cosine_silhouette <- function(Zs, clustering) {
  d_cos <- as.dist(proxy::dist(Zs, method = "cosine"))
  sil   <- cluster::silhouette(clustering, d_cos)
  mean(sil[, 3])
}
svd_list <- lapply(k_grid, function(k) {
  message(sprintf("Computing randomized SVD (k=%d) ...", k))
  set.seed(42)
  irlba::irlba(tfidf_l2, nv = k)
})
## Computing randomized SVD (k=50) ...
## Computing randomized SVD (k=100) ...
## Computing randomized SVD (k=150) ...
## Computing randomized SVD (k=200) ...
names(svd_list) <- paste0("k", k_grid)

# 4) Stability over k (spherical space) — pick the sweetest spot
stab_over_k <- purrr::map2_dbl(svd_list, k_grid, function(svd_k, k) {
  Z  <- svd_k$u %*% diag(svd_k$d)
  Zs <- normalize_rows(Z)
  km_stability(Zs, K = 8, runs = 10, nstart_each = 10, itermax = 300)
})
stab_tbl <- tibble(k = k_grid, mean_ARI = stab_over_k) |> arrange(desc(mean_ARI))
print(stab_tbl)
## # A tibble: 4 × 2
##       k mean_ARI
##   <dbl>    <dbl>
## 1   100    0.908
## 2   150    0.847
## 3    50    0.788
## 4   200    0.773
k_star <- stab_tbl$k[1]
svd_k  <- svd_list[[paste0("k", k_star)]]
Z      <- svd_k$u %*% diag(svd_k$d)
Zs     <- normalize_rows(Z)


message(sprintf("Selected DR rank k = %d (highest mean ARI)", k_star))
## Selected DR rank k = 100 (highest mean ARI)

Clustering Strategy

Alghoritm

Clustering is performed with k‑means in the LSA subspace under a spherical (cosine) geometry. Row‑normalizing document vectors reduces the influence of length and approximates spherical k‑means, which is better aligned with cosine similarity for high‑dimensional text.

Choosing K (number of clusters):

We evaluate K on a small grid (e.g., 6–12) using the same stability criterion (mean ARI across seeds). As a sanity check, we report a cosine‑based silhouette. Because Euclidean silhouettes can be misleadingly small in sparse, high‑dimensional settings, they are not the primary selection metric.

K_candidates <- 6:12
K_ranked     <- pick_K_by_stability(Zs, K_grid = K_candidates)
print(K_ranked)
## # A tibble: 7 × 2
##       K mean_ARI
##   <int>    <dbl>
## 1     6    0.979
## 2     7    0.837
## 3     8    0.828
## 4     9    0.765
## 5    10    0.681
## 6    11    0.624
## 7    12    0.608
K_star <- K_ranked$K[1]
km_fit <- kmeans(Zs, centers = K_star, nstart = 20, iter.max = 300)

sil_cos <- cosine_silhouette(Zs, km_fit$cluster)
message(sprintf("Chosen K = %d | cosine silhouette ≈ %.3f", K_star, sil_cos))
## Chosen K = 6 | cosine silhouette ≈ 0.099

Cluster Labeling (c‑TF‑IDF)

To obtain human‑readable themes, we use a class‑based TF‑IDF (c‑TF‑IDF) procedure: documents are concatenated within each cluster, and a TF‑IDF model across clusters identifies terms that are discriminative for that cluster. We report top terms as labels, which facilitates interpretation without training a probabilistic topic model.

speeches_df$cluster_km <- km_fit$cluster

cluster_sizes <- as.data.frame(table(km_fit$cluster)) |>
  dplyr::rename(cluster_km = Var1, n_docs = Freq) |>
  dplyr::arrange(desc(n_docs))

print(cluster_sizes)
##   cluster_km n_docs
## 1          2   2332
## 2          3   2112
## 3          4   2103
## 4          1   1613
## 5          6   1503
## 6          5   1289
# Save core artifacts
saveRDS(list(
  k_star = k_star,
  K_star = K_star,
  svd = svd_k,
  scores = Z,            # LSA scores
  scores_spherical = Zs, # normalized
  kmeans = km_fit,
  labels = speeches_df$cluster_km,
  cluster_sizes = cluster_sizes
), file = sprintf("clustering_artifacts_k%d_K%d.rds", k_star, K_star))

# Collapse all documents in a cluster into one "class document",
# then compute TF‑IDF across those classes to get top terms per cluster.
message("Computing c‑TF‑IDF labels ...")
## Computing c‑TF‑IDF labels ...
class_docs <- speeches_df |>
  dplyr::mutate(cluster_km = paste0("C", cluster_km)) |>
  dplyr::group_by(cluster_km) |>
  dplyr::summarise(text = paste(text_clean, collapse = " "), .groups = "drop")

corp_c <- quanteda::corpus(class_docs, text_field = "text")
toks_c <- tokens(corp_c, remove_punct = TRUE, remove_symbols = TRUE)
toks_c <- tokens_remove(toks_c, stopwords("en"))
toks_c <- tokens_wordstem(toks_c, language = "english")
dfm_c  <- dfm(toks_c)
# Classical TF‑IDF on class documents approximates c‑TF‑IDF weighting.
tfidf_c <- dfm_tfidf(dfm_c, scheme_tf = "count", scheme_df = "inverse")

top_terms_per_cluster <- function(tfidf_mat, top_n = 12) {
  terms <- colnames(tfidf_mat)
  res <- lapply(1:nrow(tfidf_mat), function(i){
    s <- as.numeric(tfidf_mat[i, ])
    ord <- order(s, decreasing = TRUE)[1:min(top_n, sum(s > 0))]
    tibble(
      cluster_km = rownames(tfidf_mat)[i],
      rank = seq_along(ord),
      term = terms[ord],
      weight = s[ord]
    )
  })
  dplyr::bind_rows(res)
}

c_labels <- top_terms_per_cluster(tfidf_c, top_n = 12)
readr::write_csv(c_labels, sprintf("cluster_labels_cTFIDF_k%d_K%d.csv", k_star, K_star))

Results

Cluster themes

library(knitr)
kable(top_terms_per_cluster(tfidf_c, top_n = 3))
cluster_km rank term weight
text1 1 bosnia 258.32588
text1 2 herzegovina 200.92013
text1 3 croatia 93.32837
text2 1 kampuchea 570.71177
text2 2 kampuchean 435.61171
text2 3 connexion 213.12924
text3 1 mdgs 368.76174
text3 2 hiv 194.93302
text3 3 nepad 142.65926
text4 1 viet-nam 1937.11229
text4 2 connexion 647.81655
text4 3 viet-names 408.41579
text5 1 covid 651.72994
text5 2 sdgs 369.66483
text5 3 pandem 231.60514
text6 1 azerbaijan 113.05059
text6 2 bosnia 69.55605
text6 3 serbia 63.74504

The c‑TF‑IDF labels indicate coherent thematic groupings. Illustrative examples include:

  • Cluster A: terms such as viet‑nam, khrushchev, and vyshinski, consistent with Cold War rhetoric and the Vietnam era.
  • Cluster B: terms such as kampuchean, suggesting debates around conflict and alignment dynamics of the mid‑to‑late Cold War.
  • Cluster C: terms such as bosnia, herzegovina, boutros‑ghali, consistent with post‑Cold War conflicts and UN peace operations in the 1990s.
  • Cluster D: terms such as mdgs, hiv, nepad, indicative of the development‑governance agenda prominent in the early 2000s.
  • Cluster E: terms such as covid, sdgs, reflecting the sustainability focus of the mid‑2010s and the public‑health shock beginning in 2020.
  • Cluster F: a more heterogeneous mix that captures cross‑cutting themes (e.g., terrorism) discussed across multiple sessions.

These labels are descriptive rather than prescriptive; they help anchor interpretation while preserving an unsupervised pipeline

Temporal dynamics

Plotting the number of documents per cluster by year reveals distinct eras where a single thematic cluster dominates, as well as hand‑offs between clusters. The timing of inflection points aligns qualitatively with widely recognized transitions in international politics (e.g., post‑colonial expansions, oil shocks and North–South political economy debates, end of the Cold War, the MDG/SDG agenda, COVID‑19, and renewed security emphasis in the early 2020s). Vertical reference lines can be added to the figure to annotate these periods.

library(ggplot2)
speeches_df_plot <- speeches_df |>
  mutate(
    year = suppressWarnings(as.integer(year)),
    iso3 = toupper(as.character(iso3)),
    cluster_km = factor(cluster_km)
  )

topN_years <- 5

per_cluster_year_counts <- speeches_df_plot |>
  filter(!is.na(year)) |>
  count(cluster_km, year, name = "n_docs") |>
  arrange(cluster_km, year)

cluster_year_summary <- per_cluster_year_counts |>
  group_by(cluster_km) |>
  summarise(
    n_docs_total   = sum(n_docs),
    first_year     = min(year, na.rm = TRUE),
    last_year      = max(year, na.rm = TRUE),
    timespan_years = last_year - first_year + 1L,
    n_years_covered= n_distinct(year),
    .groups = "drop"
  ) |>
  arrange(desc(n_docs_total))
ggplot(per_cluster_year_counts, aes(year, n_docs, color = cluster_km)) +
  geom_line(linewidth = 0.6, alpha = 0.9) +
  geom_point(size = 0.9, alpha = 0.8) +
  labs(
    title    = sprintf("UN Speeches — Documents per Year by Cluster (k=%d, K=%d)", k_star, K_star),
    x        = "Year", y = "Documents",
    color    = "Cluster"
  ) +
  theme_minimal(base_size = 12)

ggsave(sprintf("plot_docs_per_year_by_cluster_k%d_K%d.png", k_star, K_star),
       width = 9, height = 5.5, dpi = 150)

Country composition

Per‑cluster country summaries (top ISO‑3 codes and counts) show that some themes are geographically concentrated (e.g., regional conflicts, coalition statements), while others are broadly shared (e.g., development and sustainability). These distributions help distinguish clusters driven by regional crises from those reflecting global agenda items.

topN_iso <- 5

per_cluster_iso_counts <- speeches_df_plot |>
  filter(!is.na(iso3) & nzchar(iso3)) |>
  count(cluster_km, iso3, name = "n_docs") |>
  arrange(cluster_km, desc(n_docs), iso3)

cluster_iso_summary <- per_cluster_iso_counts |>
  group_by(cluster_km) |>
  summarise(
    n_docs_total   = sum(n_docs),
    n_iso3         = n_distinct(iso3),
    .groups = "drop"
  ) |>
  arrange(desc(n_docs_total))
# (E2) Top ISO3 per cluster (faceted bar chart; shows top 10 per cluster)
top_iso_plot_df <- per_cluster_iso_counts |>
  group_by(cluster_km) |>
  slice_max(order_by = n_docs, n = topN_iso, with_ties = TRUE) |>
  ungroup() |>
  mutate(iso3 = forcats::fct_reorder(iso3, n_docs))

ggplot(top_iso_plot_df, aes(x = n_docs, y = iso3, fill = cluster_km)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ cluster_km, scales = "free_y") +
  labs(
    title = sprintf("UN Speeches — Top %d ISO3 per Cluster (k=%d, K=%d)", topN_iso, k_star, K_star),
    x = "Documents", y = "ISO3"
  ) +
  theme_minimal(base_size = 12)

ggsave(sprintf("plot_top_iso_by_cluster_k%d_K%d.png", k_star, K_star),
       width = 9, height = 7, dpi = 150)

Discussion

The analysis shows that a transparent, linear pipeline—TF‑IDF → LSA → spherical k‑means—can recover meaningful structure in a long‑horizon political speech corpus. The resulting clusters correspond intuitively to known periods of UN discourse, and their temporal trajectories underline how the agenda shifts from security and sovereignty concerns to development and sustainability, and then to health and renewed security emphasis. Because labeling is derived by c‑TF‑IDF rather than exogenous dictionaries, interpretability follows from the data. Two aspects are noteworthy. First, the presence of a diffuse cluster reflects the reality that some topics pervade multiple eras and regions (e.g., terrorism, sanctions, procedural reform). Second, the stability‑based choice of rank and K avoids overfitting and documents where the clustering is robust versus where it becomes sensitive to modeling choices.