This study analyzes United Nations General Debate (UNGD) speeches from 1946 to 2024 to identify latent thematic and geopolitical structures without predefined labels. Using a classic information‑retrieval pipeline—TF‑IDF vectorization, truncated singular value decomposition (LSA) for dimension reduction, and spherical k‑means clustering—we recover coherent clusters that align with major shifts in multilateral discourse. Cluster labels are derived with a class‑based TF‑IDF (c‑TF‑IDF) procedure, and robustness is supported by stability diagnostics across random initializations. The temporal distribution of clusters highlights clear transitions in agenda setting (e.g., post‑war/decolonization, Cold War realignments, development eras, sustainability and health shocks), while country‑level summaries show geographic concentration within clusters. The results demonstrate that simple, transparent linear methods can yield interpretable structure in large political text corpora.
The UN General Debate is frequently described as a barometer of world opinion: heads of state and government outline priorities, narrate crises, and articulate international norms. This project asks whether unsupervised methods can reveal persistent thematic structures and historical transitions across nearly eight decades of debate.
Three questions guide the analysis. First, does the corpus exhibit clusterable structure rather than random variation? Second, which dimension‑reduction strategy and rank yield stable, interpretable clusters for high‑dimensional textual data? Third, how do the resulting clusters map onto recognizable periods and issues in UN discourse, and how do they vary across years and countries?
Transcripts of UN General Debate Speeches were obtained from Harvard Dataverse.
Files follow the convention ISO3_SESSION_YEAR.txt (e.g., USA_75_2020.txt).
A minimal loader extracts durable metadata—iso3, session, and year—from filenames and reads the full text into a single tibble (speeches_df). The ingestion process is audit‑friendly by design: it preserves raw text alongside all derived fields and avoids irreversible transformations at load time.
suppressPackageStartupMessages({
library(readr) # read_file()
library(dplyr) # mutate(), select(), count(), distinct()
library(stringr) # str_*
library(Matrix)
library(tidyr)
library(tibble)
library(proxy)
library(uwot)
library(quanteda)# tidy, fast text ops (tokenize, stopwords, stemming)
library(textclean) # quick cleaning helpers for OCR artifacts
library(purrr) # map_chr(), map_dfr()
library(cld3) # language detection (fast, minimal)
library(irlba) # irlba()
library(hopkins) # hopkins()
library(mclust)
})
# ---- Configuration ----
ROOT_DIR <- "C:/Users/aaa/OneDrive/Dokumenty/Courses/Unsupervised learning/Projects/data/TXT"
OCR_TRANSLATION_YEAR_MIN <- 2024 # conservative placeholder
OUTPUT_RDS <- "speeches_1946_2024.rds"
# ---- Helpers ----
set.seed(42)
# Find all .txt files recursively under a root directory
find_speech_files <- function(root) {
stopifnot(is.character(root), length(root) == 1L)
files <- list.files(
path = root,
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE
)
if (length(files) == 0L) {
stop(
sprintf("No .txt files found under '%s'. Check the path or dataset layout.", root)
)
}
files
}
# Parse filename into iso3, session, year (expected: ISO3_SESSION_YEAR.txt)
parse_filename_meta <- function(path) {
fn <- basename(path)
core <- str_remove(fn, "\\.txt$")
parts <- str_split(core, "_", simplify = TRUE)
if (ncol(parts) != 3L) {
warning(sprintf("Skipping file with unexpected name pattern: %s", fn))
return(tibble::tibble(
file_path = path, iso3 = NA_character_, session = NA_integer_, year = NA_integer_
))
}
tibble::tibble(
file_path = path,
iso3 = parts[, 1],
session = suppressWarnings(as.integer(parts[, 2])),
year = suppressWarnings(as.integer(parts[, 3]))
)
}
# Read a speech as a single UTF‑8 string and perform a basic whitespace squish
read_speech_text <- function(path) {
text_raw <- readr::read_file(path)
stringr::str_squish(text_raw)
}
# ---- Pipeline Steps ----
# 1) Build metadata table (one row per file)
build_meta <- function(file_paths) {
purrr::map_dfr(file_paths, parse_filename_meta)
}
# 2) Attach text column (read each file)
attach_text <- function(meta_tbl) {
meta_tbl %>%
mutate(text = purrr::map_chr(file_path, read_speech_text))
}
# 3) Post-process flags and quick QC
finalize_df <- function(df, ocr_year_min = OCR_TRANSLATION_YEAR_MIN) {
df %>%
# Conservative indicator for possible OCR/translation in the latest session(s)
mutate(suspected_ocr_or_translated = !is.na(year) & year >= ocr_year_min) %>%
# Drop exact duplicate rows (rare, but safe)
distinct() %>%
# Flag empty texts
mutate(is_empty = nchar(text, type = "chars", allowNA = FALSE) == 0L)
}
# 4) Quick summary (printed to console); objects are returned invisibly if needed
quick_diagnostics <- function(df) {
message("---- Quick diagnostics ----")
suppressMessages({
print(df %>% count(year, name = "n") %>% arrange(year), n = 10)
print(df %>% count(session, name = "n") %>% arrange(session), n = 10)
})
message("Summary(speeches_df):")
print(summary(df))
invisible(df)
}
# ---- Main ----
run_loader <- function(root_dir = ROOT_DIR, out_rds = OUTPUT_RDS) {
message(sprintf("Scanning for .txt files under: %s", root_dir))
speech_files <- find_speech_files(root_dir)
message(sprintf("Found %d files. Parsing metadata ...", length(speech_files)))
meta_df <- build_meta(speech_files)
# (Optional) sanity checks on meta
n_bad <- sum(is.na(meta_df$iso3) | is.na(meta_df$session) | is.na(meta_df$year))
if (n_bad > 0L) {
warning(sprintf("There are %d files with missing (iso3/session/year). See warnings above.", n_bad))
}
message("Reading text content (this may take a moment) ...")
speeches_df <- attach_text(meta_df)
message("Finalizing flags and quality checks ...")
speeches_df <- finalize_df(speeches_df, ocr_year_min = OCR_TRANSLATION_YEAR_MIN)
quick_diagnostics(speeches_df)
message(sprintf("Saving RDS to: %s", out_rds))
saveRDS(speeches_df, file = out_rds)
message("Done.")
speeches_df
}
# ---- Execute ----
# Run the loader; returns the tibble invisibly while also saving the RDS
speeches_df <- run_loader()
## Scanning for .txt files under: C:/Users/aaa/OneDrive/Dokumenty/Courses/Unsupervised learning/Projects/data/TXT
## Found 10952 files. Parsing metadata ...
## Reading text content (this may take a moment) ...
## Finalizing flags and quality checks ...
## ---- Quick diagnostics ----
## # A tibble: 79 × 2
## year n
## <int> <int>
## 1 1946 39
## 2 1947 39
## 3 1948 39
## 4 1949 35
## 5 1950 44
## 6 1951 51
## 7 1952 43
## 8 1953 44
## 9 1954 42
## 10 1955 45
## # ℹ 69 more rows
## # A tibble: 79 × 2
## session n
## <int> <int>
## 1 1 39
## 2 2 39
## 3 3 39
## 4 4 35
## 5 5 44
## 6 6 51
## 7 7 43
## 8 8 44
## 9 9 42
## 10 10 45
## # ℹ 69 more rows
## Summary(speeches_df):
## file_path iso3 session year
## Length:10952 Length:10952 Min. : 1.0 Min. :1946
## Class :character Class :character 1st Qu.:33.0 1st Qu.:1978
## Mode :character Mode :character Median :51.0 Median :1996
## Mean :48.3 Mean :1993
## 3rd Qu.:65.0 3rd Qu.:2010
## Max. :79.0 Max. :2024
## text suspected_ocr_or_translated is_empty
## Length:10952 Mode :logical Mode :logical
## Class :character FALSE:10760 FALSE:10952
## Mode :character TRUE :192
##
##
##
## Saving RDS to: speeches_1946_2024.rds
## Done.
We create a working text column text_clean from the raw text and apply light, reversible operations: Unicode normalization; removal of URLs and layout artifacts; lowercasing; punctuation and symbol stripping; optional removal of standalone numerals; and stemming. For 2024 materials and any suspected OCR/translation cases, we apply conservative whitespace normalization. A lightweight language detection step (optional) flags non‑English content for awareness without excluding documents.
speeches_df <- readRDS("speeches_1946_2024.rds")
speeches_df <- speeches_df %>%
mutate(text_clean = text) # start from raw, keep original intact
speeches_df <- speeches_df %>%
mutate(
text_clean = textclean::replace_non_ascii(text_clean), # normalize odd chars
text_clean = str_replace_all(text_clean, "\\s+", " ") # collapse whitespace
)
speeches_df <- speeches_df %>%
mutate(
text_clean = str_remove_all(text_clean, "https?://\\S+"),
text_clean = str_remove_all(text_clean, "www\\.[A-Za-z0-9\\-]+\\.[A-Za-z]{2,}\\S*"),
text_clean = str_replace_all(text_clean, "—|–", "-"),
text_clean = str_replace_all(text_clean, "·|•", " "),
text_clean = str_squish(text_clean)
)
fix_ocr <- function(x){
x %>%
textclean::replace_white() # normalizes excess spaces
}
speeches_df <- speeches_df %>%
mutate(
text_clean = ifelse(suspected_ocr_or_translated, fix_ocr(text_clean), text_clean)
)
speeches_df <- speeches_df %>%
mutate(text_clean = str_to_lower(text_clean))
speeches_df <- speeches_df %>%
mutate(text_clean = str_replace_all(text_clean, "[^\\p{L}\\p{N}\\s’'-]", " "))
# remove standalone numbers (for TF-IDF path)
speeches_df <- speeches_df %>%
mutate(text_clean = str_replace_all(text_clean, "\\b\\d+\\b", " "))
saveRDS(speeches_df, "speeches_preprocessed.rds")
We tokenize with quanteda, remove English stopwords, and apply stemming. A sparse document‑feature matrix (DFM) is trimmed with light frequency thresholds and transformed to TF‑IDF (log‑TF, inverse‑DF). Row vectors are L2‑normalized so that cosine similarity is equivalent to a dot product, which improves high‑dimensional contrast.
speeches_df <- readRDS("speeches_preprocessed.rds")
# Build TF‑IDF (sparse DTM)
corp <- quanteda::corpus(speeches_df, text_field = "text_clean")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE)
toks <- tokens_remove(toks, pattern = stopwords("en"))
toks <- tokens_wordstem(toks, language = "english")
dfm_mat <- dfm(toks)
dfm_mat <- dfm_trim(dfm_mat, min_termfreq = 5, min_docfreq = 3)
tfidf <- dfm_tfidf(dfm_mat, scheme_tf = "logcount", scheme_df = "inverse")
# L2-normalize document vectors so cosine ≈ dot product
row_norms <- sqrt(rowSums(tfidf^2))
tfidf_l2 <- Diagonal(x = 1 / pmax(row_norms, .Machine$double.eps)) %*% tfidf
saveRDS(tfidf_l2, "dfm_tfidf.rds")
We assess cluster tendency prior to fitting algorithms. A randomized truncated SVD to 50 components stabilizes distances, and the Hopkins statistic is computed on this low‑rank projection. Values near one support aggregation (clusterability), while values near one‑half suggest spatial randomness.
# Light projection to stabilize high-D distances (e.g., 50 components)
svd_50 <- irlba(tfidf_l2, nv = 50)
X50 <- svd_50$u %*% diag(svd_50$d)
H <- hopkins(X50, m = nrow(X50) %/% 10) # ≈10% sample
H
## [1] 1
We employ truncated SVD (LSA) on TF‑IDF because it is fast on sparse matrices, denoises lexical variation, and provides a linear subspace in which prototype‑based clustering behaves well. In text applications, LSA is a robust baseline that preserves interpretability while reducing dimensionality.
To avoid ad‑hoc settings, we evaluate a grid of candidate ranks (e.g., 50, 100, 150, 200) and select k via clustering stability. For each k, we project documents to the LSA space, L2‑normalize rows (spherical geometry), fit k‑means multiple times from independent initializations, and compute the mean Adjusted Rand Index (ARI) across all run pairs. The final k is chosen as the smallest rank at or near the stability peak (the “sweet spot” where adding components does not materially increase stability).
library(mclust) # adjustedRandIndex
# Choose ranks to test; you can expand this grid.
k_grid <- c(50, 100, 150, 200)
# Helper: row L2-normalization (spherical / cosine geometry)
normalize_rows <- function(M) {
rs <- sqrt(rowSums(M^2))
rs[rs == 0] <- 1
M / rs
}
# Helper: spherical k-means stability across runs (fixed K)
km_stability <- function(Z, K = 8, runs = 10, nstart_each = 10, itermax = 300, algorithm = "Lloyd") {
labs <- replicate(runs, {
km <- kmeans(Z, centers = K, nstart = nstart_each, iter.max = itermax, algorithm = algorithm)
km$cluster
})
comb <- combn(ncol(labs), 2)
mean(apply(comb, 2, function(p) mclust::adjustedRandIndex(labs[, p[1]], labs[, p[2]])))
}
# Helper: choose K by stability (grid)
pick_K_by_stability <- function(Zs, K_grid = 6:12) {
stab <- sapply(K_grid, function(K) km_stability(Zs, K = K, runs = 10, nstart_each = 10, itermax = 300))
tibble(K = K_grid, mean_ARI = stab) |> arrange(desc(mean_ARI))
}
# Helper: cosine-silhouette (sanity check, not primary for text)
cosine_silhouette <- function(Zs, clustering) {
d_cos <- as.dist(proxy::dist(Zs, method = "cosine"))
sil <- cluster::silhouette(clustering, d_cos)
mean(sil[, 3])
}
svd_list <- lapply(k_grid, function(k) {
message(sprintf("Computing randomized SVD (k=%d) ...", k))
set.seed(42)
irlba::irlba(tfidf_l2, nv = k)
})
## Computing randomized SVD (k=50) ...
## Computing randomized SVD (k=100) ...
## Computing randomized SVD (k=150) ...
## Computing randomized SVD (k=200) ...
names(svd_list) <- paste0("k", k_grid)
# 4) Stability over k (spherical space) — pick the sweetest spot
stab_over_k <- purrr::map2_dbl(svd_list, k_grid, function(svd_k, k) {
Z <- svd_k$u %*% diag(svd_k$d)
Zs <- normalize_rows(Z)
km_stability(Zs, K = 8, runs = 10, nstart_each = 10, itermax = 300)
})
stab_tbl <- tibble(k = k_grid, mean_ARI = stab_over_k) |> arrange(desc(mean_ARI))
print(stab_tbl)
## # A tibble: 4 × 2
## k mean_ARI
## <dbl> <dbl>
## 1 100 0.908
## 2 150 0.847
## 3 50 0.788
## 4 200 0.773
k_star <- stab_tbl$k[1]
svd_k <- svd_list[[paste0("k", k_star)]]
Z <- svd_k$u %*% diag(svd_k$d)
Zs <- normalize_rows(Z)
message(sprintf("Selected DR rank k = %d (highest mean ARI)", k_star))
## Selected DR rank k = 100 (highest mean ARI)
Clustering is performed with k‑means in the LSA subspace under a spherical (cosine) geometry. Row‑normalizing document vectors reduces the influence of length and approximates spherical k‑means, which is better aligned with cosine similarity for high‑dimensional text.
We evaluate K on a small grid (e.g., 6–12) using the same stability criterion (mean ARI across seeds). As a sanity check, we report a cosine‑based silhouette. Because Euclidean silhouettes can be misleadingly small in sparse, high‑dimensional settings, they are not the primary selection metric.
K_candidates <- 6:12
K_ranked <- pick_K_by_stability(Zs, K_grid = K_candidates)
print(K_ranked)
## # A tibble: 7 × 2
## K mean_ARI
## <int> <dbl>
## 1 6 0.979
## 2 7 0.837
## 3 8 0.828
## 4 9 0.765
## 5 10 0.681
## 6 11 0.624
## 7 12 0.608
K_star <- K_ranked$K[1]
km_fit <- kmeans(Zs, centers = K_star, nstart = 20, iter.max = 300)
sil_cos <- cosine_silhouette(Zs, km_fit$cluster)
message(sprintf("Chosen K = %d | cosine silhouette ≈ %.3f", K_star, sil_cos))
## Chosen K = 6 | cosine silhouette ≈ 0.099
To obtain human‑readable themes, we use a class‑based TF‑IDF (c‑TF‑IDF) procedure: documents are concatenated within each cluster, and a TF‑IDF model across clusters identifies terms that are discriminative for that cluster. We report top terms as labels, which facilitates interpretation without training a probabilistic topic model.
speeches_df$cluster_km <- km_fit$cluster
cluster_sizes <- as.data.frame(table(km_fit$cluster)) |>
dplyr::rename(cluster_km = Var1, n_docs = Freq) |>
dplyr::arrange(desc(n_docs))
print(cluster_sizes)
## cluster_km n_docs
## 1 2 2332
## 2 3 2112
## 3 4 2103
## 4 1 1613
## 5 6 1503
## 6 5 1289
# Save core artifacts
saveRDS(list(
k_star = k_star,
K_star = K_star,
svd = svd_k,
scores = Z, # LSA scores
scores_spherical = Zs, # normalized
kmeans = km_fit,
labels = speeches_df$cluster_km,
cluster_sizes = cluster_sizes
), file = sprintf("clustering_artifacts_k%d_K%d.rds", k_star, K_star))
# Collapse all documents in a cluster into one "class document",
# then compute TF‑IDF across those classes to get top terms per cluster.
message("Computing c‑TF‑IDF labels ...")
## Computing c‑TF‑IDF labels ...
class_docs <- speeches_df |>
dplyr::mutate(cluster_km = paste0("C", cluster_km)) |>
dplyr::group_by(cluster_km) |>
dplyr::summarise(text = paste(text_clean, collapse = " "), .groups = "drop")
corp_c <- quanteda::corpus(class_docs, text_field = "text")
toks_c <- tokens(corp_c, remove_punct = TRUE, remove_symbols = TRUE)
toks_c <- tokens_remove(toks_c, stopwords("en"))
toks_c <- tokens_wordstem(toks_c, language = "english")
dfm_c <- dfm(toks_c)
# Classical TF‑IDF on class documents approximates c‑TF‑IDF weighting.
tfidf_c <- dfm_tfidf(dfm_c, scheme_tf = "count", scheme_df = "inverse")
top_terms_per_cluster <- function(tfidf_mat, top_n = 12) {
terms <- colnames(tfidf_mat)
res <- lapply(1:nrow(tfidf_mat), function(i){
s <- as.numeric(tfidf_mat[i, ])
ord <- order(s, decreasing = TRUE)[1:min(top_n, sum(s > 0))]
tibble(
cluster_km = rownames(tfidf_mat)[i],
rank = seq_along(ord),
term = terms[ord],
weight = s[ord]
)
})
dplyr::bind_rows(res)
}
c_labels <- top_terms_per_cluster(tfidf_c, top_n = 12)
readr::write_csv(c_labels, sprintf("cluster_labels_cTFIDF_k%d_K%d.csv", k_star, K_star))
library(knitr)
kable(top_terms_per_cluster(tfidf_c, top_n = 3))
| cluster_km | rank | term | weight |
|---|---|---|---|
| text1 | 1 | bosnia | 258.32588 |
| text1 | 2 | herzegovina | 200.92013 |
| text1 | 3 | croatia | 93.32837 |
| text2 | 1 | kampuchea | 570.71177 |
| text2 | 2 | kampuchean | 435.61171 |
| text2 | 3 | connexion | 213.12924 |
| text3 | 1 | mdgs | 368.76174 |
| text3 | 2 | hiv | 194.93302 |
| text3 | 3 | nepad | 142.65926 |
| text4 | 1 | viet-nam | 1937.11229 |
| text4 | 2 | connexion | 647.81655 |
| text4 | 3 | viet-names | 408.41579 |
| text5 | 1 | covid | 651.72994 |
| text5 | 2 | sdgs | 369.66483 |
| text5 | 3 | pandem | 231.60514 |
| text6 | 1 | azerbaijan | 113.05059 |
| text6 | 2 | bosnia | 69.55605 |
| text6 | 3 | serbia | 63.74504 |
The c‑TF‑IDF labels indicate coherent thematic groupings. Illustrative examples include:
These labels are descriptive rather than prescriptive; they help anchor interpretation while preserving an unsupervised pipeline
Plotting the number of documents per cluster by year reveals distinct eras where a single thematic cluster dominates, as well as hand‑offs between clusters. The timing of inflection points aligns qualitatively with widely recognized transitions in international politics (e.g., post‑colonial expansions, oil shocks and North–South political economy debates, end of the Cold War, the MDG/SDG agenda, COVID‑19, and renewed security emphasis in the early 2020s). Vertical reference lines can be added to the figure to annotate these periods.
library(ggplot2)
speeches_df_plot <- speeches_df |>
mutate(
year = suppressWarnings(as.integer(year)),
iso3 = toupper(as.character(iso3)),
cluster_km = factor(cluster_km)
)
topN_years <- 5
per_cluster_year_counts <- speeches_df_plot |>
filter(!is.na(year)) |>
count(cluster_km, year, name = "n_docs") |>
arrange(cluster_km, year)
cluster_year_summary <- per_cluster_year_counts |>
group_by(cluster_km) |>
summarise(
n_docs_total = sum(n_docs),
first_year = min(year, na.rm = TRUE),
last_year = max(year, na.rm = TRUE),
timespan_years = last_year - first_year + 1L,
n_years_covered= n_distinct(year),
.groups = "drop"
) |>
arrange(desc(n_docs_total))
ggplot(per_cluster_year_counts, aes(year, n_docs, color = cluster_km)) +
geom_line(linewidth = 0.6, alpha = 0.9) +
geom_point(size = 0.9, alpha = 0.8) +
labs(
title = sprintf("UN Speeches — Documents per Year by Cluster (k=%d, K=%d)", k_star, K_star),
x = "Year", y = "Documents",
color = "Cluster"
) +
theme_minimal(base_size = 12)
ggsave(sprintf("plot_docs_per_year_by_cluster_k%d_K%d.png", k_star, K_star),
width = 9, height = 5.5, dpi = 150)
Per‑cluster country summaries (top ISO‑3 codes and counts) show that some themes are geographically concentrated (e.g., regional conflicts, coalition statements), while others are broadly shared (e.g., development and sustainability). These distributions help distinguish clusters driven by regional crises from those reflecting global agenda items.
topN_iso <- 5
per_cluster_iso_counts <- speeches_df_plot |>
filter(!is.na(iso3) & nzchar(iso3)) |>
count(cluster_km, iso3, name = "n_docs") |>
arrange(cluster_km, desc(n_docs), iso3)
cluster_iso_summary <- per_cluster_iso_counts |>
group_by(cluster_km) |>
summarise(
n_docs_total = sum(n_docs),
n_iso3 = n_distinct(iso3),
.groups = "drop"
) |>
arrange(desc(n_docs_total))
# (E2) Top ISO3 per cluster (faceted bar chart; shows top 10 per cluster)
top_iso_plot_df <- per_cluster_iso_counts |>
group_by(cluster_km) |>
slice_max(order_by = n_docs, n = topN_iso, with_ties = TRUE) |>
ungroup() |>
mutate(iso3 = forcats::fct_reorder(iso3, n_docs))
ggplot(top_iso_plot_df, aes(x = n_docs, y = iso3, fill = cluster_km)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ cluster_km, scales = "free_y") +
labs(
title = sprintf("UN Speeches — Top %d ISO3 per Cluster (k=%d, K=%d)", topN_iso, k_star, K_star),
x = "Documents", y = "ISO3"
) +
theme_minimal(base_size = 12)
ggsave(sprintf("plot_top_iso_by_cluster_k%d_K%d.png", k_star, K_star),
width = 9, height = 7, dpi = 150)
The analysis shows that a transparent, linear pipeline—TF‑IDF → LSA → spherical k‑means—can recover meaningful structure in a long‑horizon political speech corpus. The resulting clusters correspond intuitively to known periods of UN discourse, and their temporal trajectories underline how the agenda shifts from security and sovereignty concerns to development and sustainability, and then to health and renewed security emphasis. Because labeling is derived by c‑TF‑IDF rather than exogenous dictionaries, interpretability follows from the data. Two aspects are noteworthy. First, the presence of a diffuse cluster reflects the reality that some topics pervade multiple eras and regions (e.g., terrorism, sanctions, procedural reform). Second, the stability‑based choice of rank and K avoids overfitting and documents where the clustering is robust versus where it becomes sensitive to modeling choices.