quanteda.textstats is a companion package to quanteda that provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset.


1 Setup & Data Preparation / Beállítás és adatok előkészítése

# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
#                    "quanteda.textplots", "ggplot2", "dplyr"))

library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)

# ── Build a DFM from US inaugural speeches ──────────────────────────────────
corp <- data_corpus_inaugural          # 59 US inaugural addresses

toks <- corp |>
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) |>
  tokens_remove(pattern = stopwords("en")) |>
  tokens_wordstem()

dfm_all <- dfm(toks)

# Subset: last 15 speeches for compact comparisons
corp15   <- corpus_subset(corp, Year >= 1965)
toks15   <- tokens(corp15,
                   remove_punct  = TRUE,
                   remove_symbols = TRUE,
                   remove_numbers = TRUE) |>
            tokens_remove(stopwords("en")) |>
            tokens_wordstem()
dfm15    <- dfm(toks15)

cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")
## Full DFM: 60 docs × 5540 features
cat("Subset DFM:", ndoc(dfm15),  "docs ×", nfeat(dfm15),  "features\n")
## Subset DFM: 16 docs × 2822 features

2 textstat_frequency() — Term Frequency / Kifejezés gyakorisága

textstat_frequency() returns term frequencies across the whole corpus or within groups, making it easy to find the most common vocabulary.

# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)

knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
             caption = "Top 20 Stems — All Inaugural Speeches",
             align   = "lrrrr")
Top 20 Stems — All Inaugural Speeches
feature frequency rank docfreq
nation 713 1 59
govern 666 2 55
peopl 640 3 58
us 507 4 57
can 489 5 57
state 463 6 49
great 389 7 57
power 384 8 54
must 377 9 53
countri 376 10 57
upon 371 11 47
world 357 12 55
may 343 13 54
shall 316 14 51
everi 309 15 53
constitut 291 16 42
peac 288 17 51
one 286 18 53
right 286 18 56
american 277 20 49
freq20 <- textstat_frequency(dfm_all, n = 20)

ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2c7bb6", alpha = .85) +
  coord_flip() +
  labs(title    = "Top 20 Terms — US Inaugural Addresses",
       subtitle = "After stopword removal and stemming",
       x = NULL, y = "Total Frequency") +
  theme_minimal(base_size = 13)

# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
                                       "21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)

ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
                     fill = group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~group, scales = "free_y") +
  coord_flip() +
  scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
  labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 12)


3 textstat_lexdiv() — Lexical Diversity / Szókincs-változatoság

Measures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.

TTR (Type-Token Ratio - Szófaj-Szó Arány)

Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.  

MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)

Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.

ld <- textstat_lexdiv(toks15,
                      measure = c("TTR", "MATTR", "MTLD"))

# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year      <- docvars(corp15, "Year")

# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))

knitr::kable(ld[, show_cols],
             digits  = 3,
             caption = "Lexical Diversity — Speeches Since 1965")
Lexical Diversity — Speeches Since 1965
Year President TTR MATTR
1965 Johnson 0.573 0.828
1969 Nixon 0.529 0.805
1973 Nixon 0.425 0.701
1977 Carter 0.647 0.847
1981 Reagan 0.567 0.839
1985 Reagan 0.512 0.828
1989 Bush 0.516 0.820
1993 Clinton 0.584 0.832
1997 Clinton 0.510 0.812
2001 Bush 0.576 0.855
2005 Bush 0.529 0.834
2009 Obama 0.606 0.894
2013 Obama 0.596 0.880
2017 Trump 0.583 0.807
2021 Biden 0.501 0.827
2025 Trump 0.520 0.857
# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))

ld_long <- tidyr::pivot_longer(ld,
                                cols      = all_of(measure_cols),
                                names_to  = "Measure",
                                values_to = "Score")

ggplot(ld_long, aes(x = factor(Year), y = Score,
                     group = Measure, colour = Measure)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  facet_wrap(~Measure, scales = "free_y", ncol = 1) +
  labs(title = "Lexical Diversity Over Time",
       x = "Year", y = "Score") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))


4 textstat_readability() — Readability Scores / Olvashatósági mutatók

Quantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.

# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)

rd <- textstat_readability(corp15_nostop,
                           measure = c("Flesch",
                                       "Flesch.Kincaid",
                                       "FOG",
                                       "SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year      <- docvars(corp15_nostop, "Year")

knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
             digits  = 2,
             caption = "Readability Scores — Speeches Since 1965")
Readability Scores — Speeches Since 1965
Year President Flesch Flesch.Kincaid FOG SMOG
1965 Johnson 69.41 7.56 10.41 10.36
1969 Nixon 65.58 9.24 12.05 11.13
1973 Nixon 54.19 12.30 15.20 13.10
1977 Carter 53.38 11.67 14.55 13.06
1981 Reagan 58.75 9.76 12.92 12.22
1985 Reagan 57.58 10.42 13.48 12.47
1989 Bush 73.10 7.15 9.98 9.88
1993 Clinton 55.81 10.38 13.20 12.37
1997 Clinton 59.22 9.83 12.69 11.96
2001 Bush 60.12 8.93 11.63 11.37
2005 Bush 53.19 11.04 14.11 13.02
2009 Obama 60.53 10.23 12.71 11.55
2013 Obama 53.56 11.73 14.51 12.95
2017 Trump 58.58 9.17 12.16 11.78
2021 Biden 73.20 5.78 8.74 9.37
2025 Trump 55.08 9.67 12.64 12.15
ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
  geom_smooth(method = "loess", se = TRUE,
              colour = "#f46d43", fill = "#fee090", linewidth = 1) +
  geom_point(colour = "#4393c3", size = 3) +
  ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
  labs(title    = "Flesch Reading Ease Over Time",
       subtitle = "Higher = easier to read",
       x = "Year", y = "Flesch Score") +
  theme_minimal(base_size = 13)

5 textstat_dist() — Document Distance / Dokumentum távolság

Computes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.

dist_mat <- textstat_dist(dfm15, method = "euclidean")

# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))

dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15

# Heatmap via ggplot
dist_long <- dist_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")

ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
  geom_tile() +
  scale_fill_distiller(palette = "RdYlBu", direction = -1) +
  labs(title = "Euclidean Distance Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


6 textstat_simil() — Document Similarity / Dokumentum hasonlóság

The counterpart to textstat_dist(): higher values mean more similar documents. Cosine similarity is the most popular choice for text.

sim_mat <- textstat_simil(dfm15, method = "cosine")

# Same unique labels as above
sim_df  <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15

sim_long <- sim_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")

ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
  geom_tile() +
  scale_fill_distiller(palette = "YlGn", direction = 1) +
  labs(title = "Cosine Similarity Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


7 textstat_keyness() — Keyness Analysis / Kulcsszó-elemzés

Identifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.

# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)

key <- textstat_keyness(dfm_cent, target = "21st")

head(key, 15) |>
  knitr::kable(digits  = 3,
               caption = "Keyness: 21st-Century vs 20th-Century Speeches")
Keyness: 21st-Century vs 20th-Century Speeches
feature chi2 p n_target n_reference
america 245.853 0 107 153
thank 163.363 0 34 17
american 120.372 0 87 190
stori 108.238 0 19 5
job 73.102 0 15 6
today 65.895 0 41 80
day 45.848 0 37 87
border 45.278 0 11 6
soul 38.496 0 11 8
generat 37.875 0 25 51
back 37.771 0 18 26
storm 37.154 0 8 3
ideal 36.481 0 19 32
worker 34.837 0 7 2
freedom 33.312 0 46 147
# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
  quanteda.textplots::textplot_keyness(key,
                                       n      = 15,
                                       labelcolor = "grey30") +
    labs(title = "Keyness Plot — 21st vs 20th Century") +
    theme_minimal(base_size = 12)
} else {
  # Fallback ggplot version
  key_top <- rbind(head(key, 10), tail(key, 10))
  key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")

  ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("21st century" = "#1a9641",
                                 "20th century" = "#d7191c")) +
    labs(title = "Keyness: Top Distinctive Terms",
         x = NULL, y = "Chi-squared statistic", fill = NULL) +
    theme_minimal(base_size = 12)
}


8 textstat_collocations() — Collocations / Szavak egybeesései

Finds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.

# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)

col <- textstat_collocations(toks_raw,
                             size    = 2,        # bigrams
                             min_count = 5)      # at least 5 occurrences

head(col, 20) |>
  knitr::kable(digits  = 3,
               caption = "Top Bigram Collocations (λ statistic)")
Top Bigram Collocations (λ statistic)
collocation count count_nested length lambda z
of the 1786 0 2 1.563 53.069
it is 327 0 2 3.541 51.057
has been 188 0 2 5.200 50.397
have been 209 0 2 4.758 49.254
those who 130 0 2 5.817 45.564
we have 270 0 2 3.371 45.065
united states 165 0 2 7.935 43.578
of our 635 0 2 2.029 41.981
will be 220 0 2 3.337 41.329
in the 828 0 2 1.709 40.143
let us 101 0 2 6.342 37.517
should be 140 0 2 4.301 37.515
we are 187 0 2 3.127 36.252
we will 202 0 2 2.971 36.225
may be 126 0 2 4.043 35.165
fellow citizens 79 0 2 7.822 34.764
i shall 96 0 2 4.323 34.030
we must 128 0 2 3.688 33.173
must be 117 0 2 3.782 33.094
there is 104 0 2 4.150 32.331
col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
  knitr::kable(digits  = 3,
               caption = "Top Trigram Collocations")
Top Trigram Collocations
collocation count count_nested length lambda z
of which the 11 0 3 3.065 7.970
all of us 15 0 3 4.455 7.394
in which the 14 0 3 2.486 7.161
than that of 8 0 3 4.966 6.965
is not the 15 0 3 2.523 6.733
is that of 5 0 3 3.498 6.448
as that of 4 0 3 4.885 6.393
the american people 40 0 3 5.651 6.307
of president of 6 0 3 4.783 6.269
to that of 4 0 3 3.389 6.222
# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english"))

# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)

# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures

# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)

# 4. Hálózati diagram kirajzolá
library(ggplot2)
textplot_network(fcm_subset, 
                 min_freq = 0.9, 
                 vertex_labelsize = 5,
                 vertex_color = "#E41A1C",    # Pirosas csomópontok
                 edge_color = "#377EB8",      # Kékes élek
                 edge_alpha = 0.4,            # Halvány élek a jobb olvashatóságért
                 vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
                    labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
                    hjust = 0.5,
                    subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
                    caption = "Forrás: data_corpus_inaugural")


9 textstat_entropy() — Shannon entropy / Shannon-féle entrópia

Shannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs.


A Shannon-entrópia egy vélhetően bekövetkező esemény bizonytalanságát méri. A képlete a következő:

\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]

Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége.

ent <- textstat_entropy(dfm_all, margin = "features")

ent_top <- ent |>
  arrange(desc(entropy)) |>
  head(20)

ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
  geom_col(fill = "#7b3294", alpha = .8) +
  coord_flip() +
  labs(title    = "Top 20 Features by Entropy",
       subtitle = "High entropy = term used across many documents evenly",
       x = NULL, y = "Shannon Entropy") +
  theme_minimal(base_size = 12)

# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")

ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
  geom_point(colour = "#e66101", size = 4) +
  geom_segment(aes(xend = reorder(President, entropy), yend = 0),
               colour = "#e66101", linewidth = .8) +
  coord_flip() +
  labs(title = "Document-Level Entropy",
       x = NULL, y = "Entropy") +
  theme_minimal(base_size = 12)


10 textstat_summary() — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

A quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.

summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
             caption = "Corpus Summary — Speeches Since 1965",
             digits  = 1)
Corpus Summary — Speeches Since 1965
document chars sents tokens types puncts numbers symbols urls tags emojis
1965-Johnson 8205 93 1710 535 221 3 0 0 0 0
1969-Nixon 11644 103 2416 714 292 0 0 0 0 0
1973-Nixon 10007 68 1995 515 193 1 0 0 0 0
1977-Carter 6878 52 1370 501 146 3 0 0 0 0
1981-Reagan 13743 129 2781 850 349 1 0 0 0 0
1985-Reagan 14572 123 2909 876 345 11 0 0 0 0
1989-Bush 12529 141 2674 756 357 2 0 0 0 0
1993-Clinton 9113 81 1833 605 235 0 0 0 0 0
1997-Clinton 12262 111 2436 726 279 0 0 0 0 0
2001-Bush 9054 97 1806 592 222 1 0 0 0 0
2005-Bush 11923 99 2312 734 241 0 0 0 0 0
2009-Obama 13460 110 2689 900 299 0 0 0 0 0
2013-Obama 11917 88 2317 786 220 5 0 0 0 0
2017-Trump 8433 88 1660 547 215 2 0 0 0 0
2021-Biden 13133 216 2766 744 394 6 0 0 0 0
2025-Trump 17077 177 3347 950 434 4 0 0 0 0
summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
             caption = "Tokens Summary (after preprocessing)",
             digits  = 1)
Tokens Summary (after preprocessing)
document chars sents tokens types puncts numbers symbols urls tags emojis
1965-Johnson NA NA 691 396 0 0 0 0 0 0
1969-Nixon NA NA 1028 544 0 0 0 0 0 0
1973-Nixon NA NA 851 362 0 0 0 0 0 0
1977-Carter NA NA 592 383 0 0 0 0 0 0
1981-Reagan NA NA 1146 650 0 0 0 0 0 0
1985-Reagan NA NA 1291 661 0 0 0 0 0 0
1989-Bush NA NA 1092 564 0 0 0 0 0 0
1993-Clinton NA NA 798 466 0 0 0 0 0 0
1997-Clinton NA NA 1130 576 0 0 0 0 0 0
2001-Bush NA NA 783 451 0 0 0 0 0 0
2005-Bush NA NA 1041 551 0 0 0 0 0 0
2009-Obama NA NA 1173 711 0 0 0 0 0 0
2013-Obama NA NA 1031 614 0 0 0 0 0 0
2017-Trump NA NA 713 416 0 0 0 0 0 0
2021-Biden NA NA 1127 565 0 0 0 0 0 0
2025-Trump NA NA 1448 753 0 0 0 0 0 0

11 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

Function Input What it computes
textstat_frequency() DFM Term frequencies & document frequencies
textstat_lexdiv() Tokens Lexical diversity (TTR, MATTR, MTLD, …)
textstat_readability() Corpus Readability indices (Flesch, FOG, SMOG, …)
textstat_dist() DFM Pairwise document/feature distances
textstat_simil() DFM Pairwise document/feature similarities
textstat_keyness() DFM Keyness of terms in target vs reference
textstat_collocations() Tokens Multi-word collocations (λ, z-scores)
textstat_entropy() DFM Shannon entropy per doc or feature
textstat_summary() Corpus/Tokens/DFM Token, type & sentence counts per doc