quanteda.textstatsis a companion package toquantedathat provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset.
# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
# "quanteda.textplots", "ggplot2", "dplyr"))
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)
# ── Build a DFM from US inaugural speeches ──────────────────────────────────
corp <- data_corpus_inaugural # 59 US inaugural addresses
toks <- corp |>
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
tokens_remove(pattern = stopwords("en")) |>
tokens_wordstem()
dfm_all <- dfm(toks)
# Subset: last 15 speeches for compact comparisons
corp15 <- corpus_subset(corp, Year >= 1965)
toks15 <- tokens(corp15,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
tokens_remove(stopwords("en")) |>
tokens_wordstem()
dfm15 <- dfm(toks15)
cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")## Full DFM: 60 docs × 5540 features
## Subset DFM: 16 docs × 2822 features
textstat_frequency() — Term Frequency / Kifejezés
gyakoriságatextstat_frequency() returns term frequencies across the
whole corpus or within groups, making it easy to find the most common
vocabulary.
# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)
knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
caption = "Top 20 Stems — All Inaugural Speeches",
align = "lrrrr")| feature | frequency | rank | docfreq |
|---|---|---|---|
| nation | 713 | 1 | 59 |
| govern | 666 | 2 | 55 |
| peopl | 640 | 3 | 58 |
| us | 507 | 4 | 57 |
| can | 489 | 5 | 57 |
| state | 463 | 6 | 49 |
| great | 389 | 7 | 57 |
| power | 384 | 8 | 54 |
| must | 377 | 9 | 53 |
| countri | 376 | 10 | 57 |
| upon | 371 | 11 | 47 |
| world | 357 | 12 | 55 |
| may | 343 | 13 | 54 |
| shall | 316 | 14 | 51 |
| everi | 309 | 15 | 53 |
| constitut | 291 | 16 | 42 |
| peac | 288 | 17 | 51 |
| one | 286 | 18 | 53 |
| right | 286 | 18 | 56 |
| american | 277 | 20 | 49 |
freq20 <- textstat_frequency(dfm_all, n = 20)
ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "#2c7bb6", alpha = .85) +
coord_flip() +
labs(title = "Top 20 Terms — US Inaugural Addresses",
subtitle = "After stopword removal and stemming",
x = NULL, y = "Total Frequency") +
theme_minimal(base_size = 13)# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
"21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)
ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
fill = group)) +
geom_col(show.legend = FALSE) +
facet_wrap(~group, scales = "free_y") +
coord_flip() +
scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
theme_minimal(base_size = 12)textstat_lexdiv() — Lexical Diversity /
Szókincs-változatoságMeasures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.
TTR (Type-Token Ratio - Szófaj-Szó Arány)
Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.
MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)
Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.
ld <- textstat_lexdiv(toks15,
measure = c("TTR", "MATTR", "MTLD"))
# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year <- docvars(corp15, "Year")
# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))
knitr::kable(ld[, show_cols],
digits = 3,
caption = "Lexical Diversity — Speeches Since 1965")| Year | President | TTR | MATTR |
|---|---|---|---|
| 1965 | Johnson | 0.573 | 0.828 |
| 1969 | Nixon | 0.529 | 0.805 |
| 1973 | Nixon | 0.425 | 0.701 |
| 1977 | Carter | 0.647 | 0.847 |
| 1981 | Reagan | 0.567 | 0.839 |
| 1985 | Reagan | 0.512 | 0.828 |
| 1989 | Bush | 0.516 | 0.820 |
| 1993 | Clinton | 0.584 | 0.832 |
| 1997 | Clinton | 0.510 | 0.812 |
| 2001 | Bush | 0.576 | 0.855 |
| 2005 | Bush | 0.529 | 0.834 |
| 2009 | Obama | 0.606 | 0.894 |
| 2013 | Obama | 0.596 | 0.880 |
| 2017 | Trump | 0.583 | 0.807 |
| 2021 | Biden | 0.501 | 0.827 |
| 2025 | Trump | 0.520 | 0.857 |
# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))
ld_long <- tidyr::pivot_longer(ld,
cols = all_of(measure_cols),
names_to = "Measure",
values_to = "Score")
ggplot(ld_long, aes(x = factor(Year), y = Score,
group = Measure, colour = Measure)) +
geom_line(linewidth = 1) +
geom_point(size = 2.5) +
facet_wrap(~Measure, scales = "free_y", ncol = 1) +
labs(title = "Lexical Diversity Over Time",
x = "Year", y = "Score") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))textstat_readability() — Readability Scores / Olvashatósági
mutatókQuantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.
# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)
rd <- textstat_readability(corp15_nostop,
measure = c("Flesch",
"Flesch.Kincaid",
"FOG",
"SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year <- docvars(corp15_nostop, "Year")
knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
digits = 2,
caption = "Readability Scores — Speeches Since 1965")| Year | President | Flesch | Flesch.Kincaid | FOG | SMOG |
|---|---|---|---|---|---|
| 1965 | Johnson | 69.41 | 7.56 | 10.41 | 10.36 |
| 1969 | Nixon | 65.58 | 9.24 | 12.05 | 11.13 |
| 1973 | Nixon | 54.19 | 12.30 | 15.20 | 13.10 |
| 1977 | Carter | 53.38 | 11.67 | 14.55 | 13.06 |
| 1981 | Reagan | 58.75 | 9.76 | 12.92 | 12.22 |
| 1985 | Reagan | 57.58 | 10.42 | 13.48 | 12.47 |
| 1989 | Bush | 73.10 | 7.15 | 9.98 | 9.88 |
| 1993 | Clinton | 55.81 | 10.38 | 13.20 | 12.37 |
| 1997 | Clinton | 59.22 | 9.83 | 12.69 | 11.96 |
| 2001 | Bush | 60.12 | 8.93 | 11.63 | 11.37 |
| 2005 | Bush | 53.19 | 11.04 | 14.11 | 13.02 |
| 2009 | Obama | 60.53 | 10.23 | 12.71 | 11.55 |
| 2013 | Obama | 53.56 | 11.73 | 14.51 | 12.95 |
| 2017 | Trump | 58.58 | 9.17 | 12.16 | 11.78 |
| 2021 | Biden | 73.20 | 5.78 | 8.74 | 9.37 |
| 2025 | Trump | 55.08 | 9.67 | 12.64 | 12.15 |
ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
geom_smooth(method = "loess", se = TRUE,
colour = "#f46d43", fill = "#fee090", linewidth = 1) +
geom_point(colour = "#4393c3", size = 3) +
ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
labs(title = "Flesch Reading Ease Over Time",
subtitle = "Higher = easier to read",
x = "Year", y = "Flesch Score") +
theme_minimal(base_size = 13)textstat_dist() — Document Distance / Dokumentum
távolságComputes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.
dist_mat <- textstat_dist(dfm15, method = "euclidean")
# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))
dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15
# Heatmap via ggplot
dist_long <- dist_df |>
as.data.frame() |>
tibble::rownames_to_column("Doc1") |>
tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")
ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
geom_tile() +
scale_fill_distiller(palette = "RdYlBu", direction = -1) +
labs(title = "Euclidean Distance Between Speeches",
x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))textstat_simil() — Document Similarity / Dokumentum
hasonlóságThe counterpart to textstat_dist(): higher values mean
more similar documents. Cosine similarity is the most
popular choice for text.
sim_mat <- textstat_simil(dfm15, method = "cosine")
# Same unique labels as above
sim_df <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15
sim_long <- sim_df |>
as.data.frame() |>
tibble::rownames_to_column("Doc1") |>
tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")
ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
geom_tile() +
scale_fill_distiller(palette = "YlGn", direction = 1) +
labs(title = "Cosine Similarity Between Speeches",
x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))textstat_keyness() — Keyness Analysis /
Kulcsszó-elemzésIdentifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.
# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)
key <- textstat_keyness(dfm_cent, target = "21st")
head(key, 15) |>
knitr::kable(digits = 3,
caption = "Keyness: 21st-Century vs 20th-Century Speeches")| feature | chi2 | p | n_target | n_reference |
|---|---|---|---|---|
| america | 245.853 | 0 | 107 | 153 |
| thank | 163.363 | 0 | 34 | 17 |
| american | 120.372 | 0 | 87 | 190 |
| stori | 108.238 | 0 | 19 | 5 |
| job | 73.102 | 0 | 15 | 6 |
| today | 65.895 | 0 | 41 | 80 |
| day | 45.848 | 0 | 37 | 87 |
| border | 45.278 | 0 | 11 | 6 |
| soul | 38.496 | 0 | 11 | 8 |
| generat | 37.875 | 0 | 25 | 51 |
| back | 37.771 | 0 | 18 | 26 |
| storm | 37.154 | 0 | 8 | 3 |
| ideal | 36.481 | 0 | 19 | 32 |
| worker | 34.837 | 0 | 7 | 2 |
| freedom | 33.312 | 0 | 46 | 147 |
# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
quanteda.textplots::textplot_keyness(key,
n = 15,
labelcolor = "grey30") +
labs(title = "Keyness Plot — 21st vs 20th Century") +
theme_minimal(base_size = 12)
} else {
# Fallback ggplot version
key_top <- rbind(head(key, 10), tail(key, 10))
key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")
ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("21st century" = "#1a9641",
"20th century" = "#d7191c")) +
labs(title = "Keyness: Top Distinctive Terms",
x = NULL, y = "Chi-squared statistic", fill = NULL) +
theme_minimal(base_size = 12)
}textstat_collocations() — Collocations / Szavak
egybeeséseiFinds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.
# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)
col <- textstat_collocations(toks_raw,
size = 2, # bigrams
min_count = 5) # at least 5 occurrences
head(col, 20) |>
knitr::kable(digits = 3,
caption = "Top Bigram Collocations (λ statistic)")| collocation | count | count_nested | length | lambda | z |
|---|---|---|---|---|---|
| of the | 1786 | 0 | 2 | 1.563 | 53.069 |
| it is | 327 | 0 | 2 | 3.541 | 51.057 |
| has been | 188 | 0 | 2 | 5.200 | 50.397 |
| have been | 209 | 0 | 2 | 4.758 | 49.254 |
| those who | 130 | 0 | 2 | 5.817 | 45.564 |
| we have | 270 | 0 | 2 | 3.371 | 45.065 |
| united states | 165 | 0 | 2 | 7.935 | 43.578 |
| of our | 635 | 0 | 2 | 2.029 | 41.981 |
| will be | 220 | 0 | 2 | 3.337 | 41.329 |
| in the | 828 | 0 | 2 | 1.709 | 40.143 |
| let us | 101 | 0 | 2 | 6.342 | 37.517 |
| should be | 140 | 0 | 2 | 4.301 | 37.515 |
| we are | 187 | 0 | 2 | 3.127 | 36.252 |
| we will | 202 | 0 | 2 | 2.971 | 36.225 |
| may be | 126 | 0 | 2 | 4.043 | 35.165 |
| fellow citizens | 79 | 0 | 2 | 7.822 | 34.764 |
| i shall | 96 | 0 | 2 | 4.323 | 34.030 |
| we must | 128 | 0 | 2 | 3.688 | 33.173 |
| must be | 117 | 0 | 2 | 3.782 | 33.094 |
| there is | 104 | 0 | 2 | 4.150 | 32.331 |
col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
knitr::kable(digits = 3,
caption = "Top Trigram Collocations")| collocation | count | count_nested | length | lambda | z |
|---|---|---|---|---|---|
| of which the | 11 | 0 | 3 | 3.065 | 7.970 |
| all of us | 15 | 0 | 3 | 4.455 | 7.394 |
| in which the | 14 | 0 | 3 | 2.486 | 7.161 |
| than that of | 8 | 0 | 3 | 4.966 | 6.965 |
| is not the | 15 | 0 | 3 | 2.523 | 6.733 |
| is that of | 5 | 0 | 3 | 3.498 | 6.448 |
| as that of | 4 | 0 | 3 | 4.885 | 6.393 |
| the american people | 40 | 0 | 3 | 5.651 | 6.307 |
| of president of | 6 | 0 | 3 | 4.783 | 6.269 |
| to that of | 4 | 0 | 3 | 3.389 | 6.222 |
# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english"))
# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)
# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures
# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)
# 4. Hálózati diagram kirajzolá
library(ggplot2)
textplot_network(fcm_subset,
min_freq = 0.9,
vertex_labelsize = 5,
vertex_color = "#E41A1C", # Pirosas csomópontok
edge_color = "#377EB8", # Kékes élek
edge_alpha = 0.4, # Halvány élek a jobb olvashatóságért
vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
hjust = 0.5,
subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
caption = "Forrás: data_corpus_inaugural")textstat_entropy() — Shannon entropy / Shannon-féle
entrópiaShannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs.
A Shannon-entrópia egy vélhetően bekövetkező esemény
bizonytalanságát méri. A képlete a következő:
\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]
Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége.
ent <- textstat_entropy(dfm_all, margin = "features")
ent_top <- ent |>
arrange(desc(entropy)) |>
head(20)
ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
geom_col(fill = "#7b3294", alpha = .8) +
coord_flip() +
labs(title = "Top 20 Features by Entropy",
subtitle = "High entropy = term used across many documents evenly",
x = NULL, y = "Shannon Entropy") +
theme_minimal(base_size = 12)# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")
ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
geom_point(colour = "#e66101", size = 4) +
geom_segment(aes(xend = reorder(President, entropy), yend = 0),
colour = "#e66101", linewidth = .8) +
coord_flip() +
labs(title = "Document-Level Entropy",
x = NULL, y = "Entropy") +
theme_minimal(base_size = 12)textstat_summary() — Corpus-Tokens-DFM Summary /
Corpus-Tokens-DFM összefoglalóA quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.
summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
caption = "Corpus Summary — Speeches Since 1965",
digits = 1)| document | chars | sents | tokens | types | puncts | numbers | symbols | urls | tags | emojis |
|---|---|---|---|---|---|---|---|---|---|---|
| 1965-Johnson | 8205 | 93 | 1710 | 535 | 221 | 3 | 0 | 0 | 0 | 0 |
| 1969-Nixon | 11644 | 103 | 2416 | 714 | 292 | 0 | 0 | 0 | 0 | 0 |
| 1973-Nixon | 10007 | 68 | 1995 | 515 | 193 | 1 | 0 | 0 | 0 | 0 |
| 1977-Carter | 6878 | 52 | 1370 | 501 | 146 | 3 | 0 | 0 | 0 | 0 |
| 1981-Reagan | 13743 | 129 | 2781 | 850 | 349 | 1 | 0 | 0 | 0 | 0 |
| 1985-Reagan | 14572 | 123 | 2909 | 876 | 345 | 11 | 0 | 0 | 0 | 0 |
| 1989-Bush | 12529 | 141 | 2674 | 756 | 357 | 2 | 0 | 0 | 0 | 0 |
| 1993-Clinton | 9113 | 81 | 1833 | 605 | 235 | 0 | 0 | 0 | 0 | 0 |
| 1997-Clinton | 12262 | 111 | 2436 | 726 | 279 | 0 | 0 | 0 | 0 | 0 |
| 2001-Bush | 9054 | 97 | 1806 | 592 | 222 | 1 | 0 | 0 | 0 | 0 |
| 2005-Bush | 11923 | 99 | 2312 | 734 | 241 | 0 | 0 | 0 | 0 | 0 |
| 2009-Obama | 13460 | 110 | 2689 | 900 | 299 | 0 | 0 | 0 | 0 | 0 |
| 2013-Obama | 11917 | 88 | 2317 | 786 | 220 | 5 | 0 | 0 | 0 | 0 |
| 2017-Trump | 8433 | 88 | 1660 | 547 | 215 | 2 | 0 | 0 | 0 | 0 |
| 2021-Biden | 13133 | 216 | 2766 | 744 | 394 | 6 | 0 | 0 | 0 | 0 |
| 2025-Trump | 17077 | 177 | 3347 | 950 | 434 | 4 | 0 | 0 | 0 | 0 |
summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
caption = "Tokens Summary (after preprocessing)",
digits = 1)| document | chars | sents | tokens | types | puncts | numbers | symbols | urls | tags | emojis |
|---|---|---|---|---|---|---|---|---|---|---|
| 1965-Johnson | NA | NA | 691 | 396 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1969-Nixon | NA | NA | 1028 | 544 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1973-Nixon | NA | NA | 851 | 362 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1977-Carter | NA | NA | 592 | 383 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1981-Reagan | NA | NA | 1146 | 650 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1985-Reagan | NA | NA | 1291 | 661 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1989-Bush | NA | NA | 1092 | 564 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1993-Clinton | NA | NA | 798 | 466 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1997-Clinton | NA | NA | 1130 | 576 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2001-Bush | NA | NA | 783 | 451 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2005-Bush | NA | NA | 1041 | 551 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2009-Obama | NA | NA | 1173 | 711 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2013-Obama | NA | NA | 1031 | 614 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2017-Trump | NA | NA | 713 | 416 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2021-Biden | NA | NA | 1127 | 565 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2025-Trump | NA | NA | 1448 | 753 | 0 | 0 | 0 | 0 | 0 | 0 |
| Function | Input | What it computes |
|---|---|---|
textstat_frequency() |
DFM | Term frequencies & document frequencies |
textstat_lexdiv() |
Tokens | Lexical diversity (TTR, MATTR, MTLD, …) |
textstat_readability() |
Corpus | Readability indices (Flesch, FOG, SMOG, …) |
textstat_dist() |
DFM | Pairwise document/feature distances |
textstat_simil() |
DFM | Pairwise document/feature similarities |
textstat_keyness() |
DFM | Keyness of terms in target vs reference |
textstat_collocations() |
Tokens | Multi-word collocations (λ, z-scores) |
textstat_entropy() |
DFM | Shannon entropy per doc or feature |
textstat_summary() |
Corpus/Tokens/DFM | Token, type & sentence counts per doc |