quanteda.textstats is a companion package to quanteda that provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset.

1 Setup & Data Preparation / Beállítás és adatok előkészítése

# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
#                    "quanteda.textplots", "ggplot2", "dplyr"))

library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)

# ── Build a DFM from US inaugural speeches ──────────────────────────────────
corp <- data_corpus_inaugural          # 59 US inaugural addresses

toks <- corp |>
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) |>
  tokens_remove(pattern = stopwords("en")) |>
  tokens_wordstem()

dfm_all <- dfm(toks)

# Subset: last 15 speeches for compact comparisons
corp15   <- corpus_subset(corp, Year >= 1965)
toks15   <- tokens(corp15,
                   remove_punct  = TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE) |>
            tokens_remove(stopwords("en")) |>
            tokens_wordstem()
dfm15    <- dfm(toks15)

cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")

## Full DFM: 60 docs × 5540 features

cat("Subset DFM:", ndoc(dfm15),  "docs ×", nfeat(dfm15),  "features\n")

## Subset DFM: 16 docs × 2822 features

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

textstat_frequency() returns term frequencies across the whole corpus or within groups, making it easy to find the most common vocabulary.

# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)

knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
             caption = "Top 20 Stems — All Inaugural Speeches",
             align   = "lrrrr")

Top 20 Stems — All Inaugural Speeches
feature	frequency	rank	docfreq
nation	713	1	59
govern	666	2	55
peopl	640	3	58
us	507	4	57
can	489	5	57
state	463	6	49
great	389	7	57
power	384	8	54
must	377	9	53
countri	376	10	57
upon	371	11	47
world	357	12	55
may	343	13	54
shall	316	14	51
everi	309	15	53
constitut	291	16	42
peac	288	17	51
one	286	18	53
right	286	18	56
american	277	20	49

freq20 <- textstat_frequency(dfm_all, n = 20)

ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2c7bb6", alpha = .85) +
  coord_flip() +
  labs(title    = "Top 20 Terms — US Inaugural Addresses",
       subtitle = "After stopword removal and stemming",
       x = NULL, y = "Total Frequency") +
  theme_minimal(base_size = 13)

# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
                                       "21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)

ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
                     fill = group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~group, scales = "free_y") +
  coord_flip() +
  scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
  labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 12)

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatoság

Measures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.

TTR (Type-Token Ratio - Szófaj-Szó Arány)

Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.

MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)

Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.

ld <- textstat_lexdiv(toks15,
                      measure = c("TTR", "MATTR", "MTLD"))

# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year      <- docvars(corp15, "Year")

# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))

knitr::kable(ld[, show_cols],
             digits  = 3,
             caption = "Lexical Diversity — Speeches Since 1965")

Lexical Diversity — Speeches Since 1965
Year	President	TTR	MATTR
1965	Johnson	0.573	0.828
1969	Nixon	0.529	0.805
1973	Nixon	0.425	0.701
1977	Carter	0.647	0.847
1981	Reagan	0.567	0.839
1985	Reagan	0.512	0.828
1989	Bush	0.516	0.820
1993	Clinton	0.584	0.832
1997	Clinton	0.510	0.812
2001	Bush	0.576	0.855
2005	Bush	0.529	0.834
2009	Obama	0.606	0.894
2013	Obama	0.596	0.880
2017	Trump	0.583	0.807
2021	Biden	0.501	0.827
2025	Trump	0.520	0.857

# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))

ld_long <- tidyr::pivot_longer(ld,
                                cols      = all_of(measure_cols),
                                names_to  = "Measure",
                                values_to = "Score")

ggplot(ld_long, aes(x = factor(Year), y = Score,
                     group = Measure, colour = Measure)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  facet_wrap(~Measure, scales = "free_y", ncol = 1) +
  labs(title = "Lexical Diversity Over Time",
       x = "Year", y = "Score") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

Quantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.

# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)

rd <- textstat_readability(corp15_nostop,
                           measure = c("Flesch",
                                       "Flesch.Kincaid",
                                       "FOG",
                                       "SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year      <- docvars(corp15_nostop, "Year")

knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
             digits  = 2,
             caption = "Readability Scores — Speeches Since 1965")

Readability Scores — Speeches Since 1965
Year	President	Flesch	Flesch.Kincaid	FOG	SMOG
1965	Johnson	69.41	7.56	10.41	10.36
1969	Nixon	65.58	9.24	12.05	11.13
1973	Nixon	54.19	12.30	15.20	13.10
1977	Carter	53.38	11.67	14.55	13.06
1981	Reagan	58.75	9.76	12.92	12.22
1985	Reagan	57.58	10.42	13.48	12.47
1989	Bush	73.10	7.15	9.98	9.88
1993	Clinton	55.81	10.38	13.20	12.37
1997	Clinton	59.22	9.83	12.69	11.96
2001	Bush	60.12	8.93	11.63	11.37
2005	Bush	53.19	11.04	14.11	13.02
2009	Obama	60.53	10.23	12.71	11.55
2013	Obama	53.56	11.73	14.51	12.95
2017	Trump	58.58	9.17	12.16	11.78
2021	Biden	73.20	5.78	8.74	9.37
2025	Trump	55.08	9.67	12.64	12.15

ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
  geom_smooth(method = "loess", se = TRUE,
              colour = "#f46d43", fill = "#fee090", linewidth = 1) +
  geom_point(colour = "#4393c3", size = 3) +
  ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
  labs(title    = "Flesch Reading Ease Over Time",
       subtitle = "Higher = easier to read",
       x = "Year", y = "Flesch Score") +
  theme_minimal(base_size = 13)

5 `textstat_dist()` — Document Distance / Dokumentum távolság

Computes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.

dist_mat <- textstat_dist(dfm15, method = "euclidean")

# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))

dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15

# Heatmap via ggplot
dist_long <- dist_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")

ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
  geom_tile() +
  scale_fill_distiller(palette = "RdYlBu", direction = -1) +
  labs(title = "Euclidean Distance Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

The counterpart to textstat_dist(): higher values mean more similar documents. Cosine similarity is the most popular choice for text.

sim_mat <- textstat_simil(dfm15, method = "cosine")

# Same unique labels as above
sim_df  <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15

sim_long <- sim_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")

ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
  geom_tile() +
  scale_fill_distiller(palette = "YlGn", direction = 1) +
  labs(title = "Cosine Similarity Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

Identifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.

# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)

key <- textstat_keyness(dfm_cent, target = "21st")

head(key, 15) |>
  knitr::kable(digits  = 3,
               caption = "Keyness: 21st-Century vs 20th-Century Speeches")

Keyness: 21st-Century vs 20th-Century Speeches
feature	chi2	n_target	n_reference
america	245.853	107	153
thank	163.363	34	17
american	120.372	87	190
stori	108.238	19	5
job	73.102	15	6
today	65.895	41	80
day	45.848	37	87
border	45.278	11	6
soul	38.496	11	8
generat	37.875	25	51
back	37.771	18	26
storm	37.154	8	3
ideal	36.481	19	32
worker	34.837	7	2
freedom	33.312	46	147

# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
  quanteda.textplots::textplot_keyness(key,
                                       n      = 15,
                                       labelcolor = "grey30") +
    labs(title = "Keyness Plot — 21st vs 20th Century") +
    theme_minimal(base_size = 12)
} else {
  # Fallback ggplot version
  key_top <- rbind(head(key, 10), tail(key, 10))
  key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")

  ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("21st century" = "#1a9641",
                                 "20th century" = "#d7191c")) +
    labs(title = "Keyness: Top Distinctive Terms",
         x = NULL, y = "Chi-squared statistic", fill = NULL) +
    theme_minimal(base_size = 12)
}

8 `textstat_collocations()` — Collocations / Szavak egybeesései

Finds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.

# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)

col <- textstat_collocations(toks_raw,
                             size    = 2,        # bigrams
                             min_count = 5)      # at least 5 occurrences

head(col, 20) |>
  knitr::kable(digits  = 3,
               caption = "Top Bigram Collocations (λ statistic)")

Top Bigram Collocations (λ statistic)
collocation	count	length	lambda	z
of the	1786	2	1.563	53.069
it is	327	2	3.541	51.057
has been	188	2	5.200	50.397
have been	209	2	4.758	49.254
those who	130	2	5.817	45.564
we have	270	2	3.371	45.065
united states	165	2	7.935	43.578
of our	635	2	2.029	41.981
will be	220	2	3.337	41.329
in the	828	2	1.709	40.143
let us	101	2	6.342	37.517
should be	140	2	4.301	37.515
we are	187	2	3.127	36.252
we will	202	2	2.971	36.225
may be	126	2	4.043	35.165
fellow citizens	79	2	7.822	34.764
i shall	96	2	4.323	34.030
we must	128	2	3.688	33.173
must be	117	2	3.782	33.094
there is	104	2	4.150	32.331

col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
  knitr::kable(digits  = 3,
               caption = "Top Trigram Collocations")

Top Trigram Collocations
collocation	count	length	lambda	z
of which the	11	3	3.065	7.970
all of us	15	3	4.455	7.394
in which the	14	3	2.486	7.161
than that of	8	3	4.966	6.965
is not the	15	3	2.523	6.733
is that of	5	3	3.498	6.448
as that of	4	3	4.885	6.393
the american people	40	3	5.651	6.307
of president of	6	3	4.783	6.269
to that of	4	3	3.389	6.222

# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english"))

# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)

# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures

# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)

# 4. Hálózati diagram kirajzolá
library(ggplot2)
textplot_network(fcm_subset, 
                 min_freq = 0.9, 
                 vertex_labelsize = 5,
                 vertex_color = "#E41A1C",    # Pirosas csomópontok
                 edge_color = "#377EB8",      # Kékes élek
                 edge_alpha = 0.4,            # Halvány élek a jobb olvashatóságért
                 vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
                    labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
                    hjust = 0.5,
                    subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
                    caption = "Forrás: data_corpus_inaugural")

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

Shannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs.

A Shannon-entrópia egy vélhetően bekövetkező esemény bizonytalanságát méri. A képlete a következő:

\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]

Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége.

ent <- textstat_entropy(dfm_all, margin = "features")

ent_top <- ent |>
  arrange(desc(entropy)) |>
  head(20)

ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
  geom_col(fill = "#7b3294", alpha = .8) +
  coord_flip() +
  labs(title    = "Top 20 Features by Entropy",
       subtitle = "High entropy = term used across many documents evenly",
       x = NULL, y = "Shannon Entropy") +
  theme_minimal(base_size = 12)

# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")

ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
  geom_point(colour = "#e66101", size = 4) +
  geom_segment(aes(xend = reorder(President, entropy), yend = 0),
               colour = "#e66101", linewidth = .8) +
  coord_flip() +
  labs(title = "Document-Level Entropy",
       x = NULL, y = "Entropy") +
  theme_minimal(base_size = 12)

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

A quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.

summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
             caption = "Corpus Summary — Speeches Since 1965",
             digits  = 1)

Corpus Summary — Speeches Since 1965
document	chars	sents	tokens	types	puncts	numbers
1965-Johnson	8205	93	1710	535	221	3
1969-Nixon	11644	103	2416	714	292	0
1973-Nixon	10007	68	1995	515	193	1
1977-Carter	6878	52	1370	501	146	3
1981-Reagan	13743	129	2781	850	349	1
1985-Reagan	14572	123	2909	876	345	11
1989-Bush	12529	141	2674	756	357	2
1993-Clinton	9113	81	1833	605	235	0
1997-Clinton	12262	111	2436	726	279	0
2001-Bush	9054	97	1806	592	222	1
2005-Bush	11923	99	2312	734	241	0
2009-Obama	13460	110	2689	900	299	0
2013-Obama	11917	88	2317	786	220	5
2017-Trump	8433	88	1660	547	215	2
2021-Biden	13133	216	2766	744	394	6
2025-Trump	17077	177	3347	950	434	4

summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
             caption = "Tokens Summary (after preprocessing)",
             digits  = 1)

Tokens Summary (after preprocessing)
document	chars	sents	tokens	types
1965-Johnson	NA	NA	691	396
1969-Nixon	NA	NA	1028	544
1973-Nixon	NA	NA	851	362
1977-Carter	NA	NA	592	383
1981-Reagan	NA	NA	1146	650
1985-Reagan	NA	NA	1291	661
1989-Bush	NA	NA	1092	564
1993-Clinton	NA	NA	798	466
1997-Clinton	NA	NA	1130	576
2001-Bush	NA	NA	783	451
2005-Bush	NA	NA	1041	551
2009-Obama	NA	NA	1173	711
2013-Obama	NA	NA	1031	614
2017-Trump	NA	NA	713	416
2021-Biden	NA	NA	1127	565
2025-Trump	NA	NA	1448	753

11 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

Function	Input	What it computes
`textstat_frequency()`	DFM	Term frequencies & document frequencies
`textstat_lexdiv()`	Tokens	Lexical diversity (TTR, MATTR, MTLD, …)
`textstat_readability()`	Corpus	Readability indices (Flesch, FOG, SMOG, …)
`textstat_dist()`	DFM	Pairwise document/feature distances
`textstat_simil()`	DFM	Pairwise document/feature similarities
`textstat_keyness()`	DFM	Keyness of terms in target vs reference
`textstat_collocations()`	Tokens	Multi-word collocations (λ, z-scores)
`textstat_entropy()`	DFM	Shannon entropy per doc or feature
`textstat_summary()`	Corpus/Tokens/DFM	Token, type & sentence counts per doc

quanteda.textstats — Feature Showcase

Generated Demo (WR)

2026-04-11

1 Setup & Data Preparation / Beállítás és adatok előkészítése

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatoság

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

5 `textstat_dist()` — Document Distance / Dokumentum távolság

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

8 `textstat_collocations()` — Collocations / Szavak egybeesései

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

11 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

quanteda.textstats — Feature Showcase

Generated Demo (WR)

2026-04-11

1 Setup & Data Preparation / Beállítás és adatok előkészítése

2 textstat_frequency() — Term Frequency / Kifejezés gyakorisága

3 textstat_lexdiv() — Lexical Diversity / Szókincs-változatoság

4 textstat_readability() — Readability Scores / Olvashatósági mutatók

5 textstat_dist() — Document Distance / Dokumentum távolság

6 textstat_simil() — Document Similarity / Dokumentum hasonlóság

7 textstat_keyness() — Keyness Analysis / Kulcsszó-elemzés

8 textstat_collocations() — Collocations / Szavak egybeesései

9 textstat_entropy() — Shannon entropy / Shannon-féle entrópia

10 textstat_summary() — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

11 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatoság

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

5 `textstat_dist()` — Document Distance / Dokumentum távolság

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

8 `textstat_collocations()` — Collocations / Szavak egybeesései

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló