quanteda.textstats is a companion package to quanteda that provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset. At the end of the document, I also introduce a few other R packages related to lexicology.
A quanteda.textstats a quanteda kiegészítő csomagja, amely statisztikai elemzési függvényeket biztosít szöveges objektumokhoz (korpuszokhoz, tokenekhez és dokumentum-jellemző mátrixokhoz). Ez a dokumentum a beépített data_corpus_inaugural adatbázist felhasználva, reprodukálható példákkal mutatja be az egyes főbb függvényeket. A dokumentum végén még néhány egyéb, lexikológia tárgykörhöz készült R csomagot is bemutatok.


1 Setup & Data Preparation / Beállítás és adatok előkészítése

# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
#                    "quanteda.textplots", "ggplot2", "dplyr"))

library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)
library(paletteer)
library(plotly)
library(kableExtra)
library(textstem)
tokens <- quanteda::tokens   # pin tokens() to quanteda for this session

# ── Build a DFM from US inaugural speeches ─────────
corp <- data_corpus_inaugural 
toks <- corp |>
  quanteda::tokens(remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE) |>
  quanteda::tokens_remove(pattern = stopwords("en"))

dfm_all <- dfm(toks)

# Subset: last 15 speeches for compact comparisons
corp15   <- corpus_subset(corp, Year >= 1965)

toks15 <- corp15 |>
  quanteda::tokens(remove_punct  = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE) |>
  quanteda::tokens_remove(pattern = stopwords("en"))

dfm15    <- dfm(toks15)

cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")
## Full DFM: 60 docs × 9359 features
cat("Subset DFM:", ndoc(dfm15),  "docs ×", nfeat(dfm15),  "features\n")
## Subset DFM: 16 docs × 3993 features

2 textstat_frequency() — Term Frequency / Kifejezés gyakorisága

textstat_frequency() returns term frequencies across the whole corpus or within groups. Ez a függvény szógyakoriságokat ad vissza az egész korpuszban, vagy annak csoportjaiban. The docfreq column shows the number of documents in which the word appears at least once. A docfreq oszlop azon dokumentumok számát mutatja, amelyekben a szó legalább egyszer előfordul.

# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)

knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
             caption = "Top 20 Stems — All Inaugural Speeches",
             align   = "lrrrr")
Top 20 Stems — All Inaugural Speeches
feature frequency rank docfreq
people 592 1 58
government 575 2 53
us 507 3 57
can 489 4 57
must 377 5 53
upon 371 6 47
great 354 7 57
may 343 8 54
states 343 8 48
world 329 10 54
nation 324 11 55
country 323 12 55
shall 316 13 51
every 309 14 53
one 274 15 50
peace 259 16 48
new 256 17 51
power 246 18 49
now 238 19 54
public 227 20 43
freq20 <- textstat_frequency(dfm_all, n = 20)

ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2c7bb6", alpha = .85) +
  coord_flip() +
  labs(title    = "Top 20 Terms — US Inaugural Addresses",
       subtitle = "After stopword removal and stemming",
       x = NULL, y = "Total Frequency") +
  theme_minimal(base_size = 13)

# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
                                       "21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)

ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
                     fill = group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~group, scales = "free_y") +
  coord_flip() +
  scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
  labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 12)


2.1 Textstat-freqency - The President … / Az elnök

2.2 … and his favourite words / … és kedvenc szavai

3 textstat_lexdiv() — Lexical Diversity / Szókincs-változatosság

Measures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.

TTR (Type-Token Ratio - Szófaj-Szó Arány)

Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.  

MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)

Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.

ld <- textstat_lexdiv(toks15,
                      measure = c("TTR", "MATTR", "MTLD"))

# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year      <- docvars(corp15, "Year")

# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))

knitr::kable(ld[, show_cols],
             digits  = 3,
             caption = "Lexical Diversity — Speeches Since 1965")
Lexical Diversity — Speeches Since 1965
Year President TTR MATTR
1965 Johnson 0.637 0.853
1969 Nixon 0.604 0.834
1973 Nixon 0.495 0.743
1977 Carter 0.694 0.867
1981 Reagan 0.646 0.867
1985 Reagan 0.587 0.857
1989 Bush 0.589 0.846
1993 Clinton 0.648 0.856
1997 Clinton 0.560 0.836
2001 Bush 0.642 0.886
2005 Bush 0.618 0.862
2009 Obama 0.683 0.912
2013 Obama 0.661 0.902
2017 Trump 0.641 0.834
2021 Biden 0.557 0.843
2025 Trump 0.581 0.874
# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))

ld_long <- tidyr::pivot_longer(ld,
                                cols      = all_of(measure_cols),
                                names_to  = "Measure",
                                values_to = "Score")

ggplot(ld_long, aes(x = factor(Year), y = Score,
                     group = Measure, colour = Measure)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  facet_wrap(~Measure, scales = "free_y", ncol = 1) +
  labs(title = "Lexical Diversity Over Time",
       x = "Year", y = "Score") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))


4 textstat_readability() — Readability Scores / Olvashatósági mutatók

Quantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.

# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)

rd <- textstat_readability(corp15_nostop,
                           measure = c("Flesch",
                                       "Flesch.Kincaid",
                                       "FOG",
                                       "SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year      <- docvars(corp15_nostop, "Year")

knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
             digits  = 2,
             caption = "Readability Scores — Speeches Since 1965")
Readability Scores — Speeches Since 1965
Year President Flesch Flesch.Kincaid FOG SMOG
1965 Johnson 69.41 7.56 10.41 10.36
1969 Nixon 65.58 9.24 12.05 11.13
1973 Nixon 54.19 12.30 15.20 13.10
1977 Carter 53.38 11.67 14.55 13.06
1981 Reagan 58.75 9.76 12.92 12.22
1985 Reagan 57.58 10.42 13.48 12.47
1989 Bush 73.10 7.15 9.98 9.88
1993 Clinton 55.81 10.38 13.20 12.37
1997 Clinton 59.22 9.83 12.69 11.96
2001 Bush 60.12 8.93 11.63 11.37
2005 Bush 53.19 11.04 14.11 13.02
2009 Obama 60.53 10.23 12.71 11.55
2013 Obama 53.56 11.73 14.51 12.95
2017 Trump 58.58 9.17 12.16 11.78
2021 Biden 73.20 5.78 8.74 9.37
2025 Trump 55.08 9.67 12.64 12.15
ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
  geom_smooth(method = "loess", se = TRUE,
              colour = "#f46d43", fill = "#fee090", linewidth = 1) +
  geom_point(colour = "#4393c3", size = 3) +
  ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
  labs(title    = "Flesch Reading Ease Over Time",
       subtitle = "Higher = easier to read",
       x = "Year", y = "Flesch Score") +
  theme_minimal(base_size = 13)

5 textstat_dist() — Document Distance / Dokumentum távolság

Computes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.

dist_mat <- textstat_dist(dfm15, method = "euclidean")

# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))

dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15

# Heatmap via ggplot
dist_long <- dist_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")

ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
  geom_tile() +
  scale_fill_distiller(palette = "RdYlBu", direction = -1) +
  labs(title = "Euclidean Distance Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


6 textstat_simil() — Document Similarity / Dokumentum hasonlóság

The counterpart to textstat_dist(): higher values mean more similar documents. Cosine similarity is the most popular choice for text.

sim_mat <- textstat_simil(dfm15, method = "cosine")

# Same unique labels as above
sim_df  <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15

sim_long <- sim_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")

ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
  geom_tile() +
  scale_fill_distiller(palette = "YlGn", direction = 1) +
  labs(title = "Cosine Similarity Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


7 textstat_keyness() — Keyness Analysis / Kulcsszó-elemzés

Identifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.

# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)

key <- textstat_keyness(dfm_cent, target = "21st")

head(key, 15) |>
  knitr::kable(digits  = 3,
               caption = "Keyness: 21st-Century vs 20th-Century Speeches")
Keyness: 21st-Century vs 20th-Century Speeches
feature chi2 p n_target n_reference
america 192.347 0 88 132
thank 187.142 0 34 11
americans 104.360 0 40 50
story 100.626 0 18 5
jobs 73.151 0 12 2
today 53.647 0 37 78
america’s 50.472 0 18 19
borders 44.372 0 9 3
nation 42.932 0 72 252
day 42.776 0 30 64
workers 40.756 0 7 1
raging 40.288 0 6 0
american 40.158 0 47 138
back 39.316 0 18 25
president 38.315 0 31 73
# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
  quanteda.textplots::textplot_keyness(key,
                                       n      = 15,
                                       labelcolor = "grey30") +
    labs(title = "Keyness Plot — 21st vs 20th Century") +
    theme_minimal(base_size = 12)
} else {
  # Fallback ggplot version
  key_top <- rbind(head(key, 10), tail(key, 10))
  key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")

  ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("21st century" = "#1a9641",
                                 "20th century" = "#d7191c")) +
    labs(title = "Keyness: Top Distinctive Terms",
         x = NULL, y = "Chi-squared statistic", fill = NULL) +
    theme_minimal(base_size = 12)
}


8 textstat_collocations() — Collocations / Szavak egybeesései

Finds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.

# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)

col <- textstat_collocations(toks_raw,
                             size    = 2,        # bigrams
                             min_count = 5)      # at least 5 occurrences

head(col, 20) |>
  knitr::kable(digits  = 3,
               caption = "Top Bigram Collocations (λ statistic)")
Top Bigram Collocations (λ statistic)
collocation count count_nested length lambda z
of the 1786 0 2 1.563 53.069
it is 327 0 2 3.541 51.057
has been 188 0 2 5.200 50.397
have been 209 0 2 4.758 49.254
those who 130 0 2 5.817 45.564
we have 270 0 2 3.371 45.065
united states 165 0 2 7.935 43.578
of our 635 0 2 2.029 41.981
will be 220 0 2 3.337 41.329
in the 828 0 2 1.709 40.143
let us 101 0 2 6.342 37.517
should be 140 0 2 4.301 37.515
we are 187 0 2 3.127 36.252
we will 202 0 2 2.971 36.225
may be 126 0 2 4.043 35.165
fellow citizens 79 0 2 7.822 34.764
i shall 96 0 2 4.323 34.030
we must 128 0 2 3.688 33.173
must be 117 0 2 3.782 33.094
there is 104 0 2 4.150 32.331
col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
  knitr::kable(digits  = 3,
               caption = "Top Trigram Collocations")
Top Trigram Collocations
collocation count count_nested length lambda z
of which the 11 0 3 3.065 7.970
all of us 15 0 3 4.455 7.394
in which the 14 0 3 2.486 7.161
than that of 8 0 3 4.966 6.965
is not the 15 0 3 2.523 6.733
is that of 5 0 3 3.498 6.448
as that of 4 0 3 4.885 6.393
the american people 40 0 3 5.651 6.307
of president of 6 0 3 4.783 6.269
to that of 4 0 3 3.389 6.222
# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english"))

# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)

# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures

# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)

# 4. Hálózati diagram kirajzolása
library(ggplot2)
textplot_network(fcm_subset, 
                 min_freq = 0.9, 
                 vertex_labelsize = 5,
                 vertex_color = "#E41A1C",    # Pirosas csomópontok
                 edge_color = "#377EB8",      # Kékes élek
                 edge_alpha = 0.4,            # Halvány élek a jobb olvashatóságért
                 vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
                    labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
                    hjust = 0.5,
                    subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
                    caption = "Forrás: data_corpus_inaugural")


9 textstat_entropy() — Shannon entropy / Shannon-féle entrópia

Shannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs. A szavak entrópiájának vizsgálata: egy szó, amely csak egy dokumentumban fordul elő, alacsony entrópiával rendelkezik, míg egy olyan szó, amely sok dokumentumban egyenletesen eloszlik, magas entrópiával bír és csak néhány dokumentumban koncentrálódik.


A Shannon-entrópia egy vélhetően bekövetkező esemény bizonytalanságát méri. A képlete a következő:

\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]

Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége. A függvény akkor éri el a maximumát, amikor a rendszer a lehető legkevésbé jósolható előre, azaz egyenletes az eloszlás. Ez azt jelenti, hogy minden lehetséges kimenetel pontosan ugyanakkora valószínűséggel következik be. Képletben: p1 = p2 = p3 = … = pn = 1/n A maximum értékéhez a feltételt a képletbe helyettesítve: \[H_{max} = \log_{2}(n) \text{ bit}\] A Shannon-entrópia értéke mindig nem negatív, és a rendszer bizonytalanságának mértékét jelzi. Minél magasabb az entrópia, annál nagyobb a bizonytalanság, és annál több információt tartalmaz a rendszer. A legegyszerűbb példa az érmefeldobás: egy tisztességes érme esetén két kimenetel van (fej vagy írás), mindkettő valószínűsége 0,5. Ebben az esetben a Shannon-entrópia értéke: \[H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \text{ bit}\] Ez azt jelenti, hogy egy tisztességes érme feldobása 1 bit információt tartalmaz, és a kimenetel teljesen bizonytalan. Ha az érme csaló lenne, és mindig fej lenne, akkor a Shannon-entrópia értéke 0 lenne, mivel nincs bizonytalanság: \[H(X) = - (1 \log_2 1 + 0 \log_2 0) = 0 \text{ bit}\]

ent <- textstat_entropy(dfm_all, margin = "features")

ent_top <- ent |>
  arrange(desc(entropy)) |>
  head(20)

ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
  geom_col(fill = "#7b3294", alpha = .8) +
  coord_flip() +
  labs(title    = "Top 20 Features by Entropy",
       subtitle = "High entropy = term used across many documents evenly",
       x = NULL, y = "Shannon Entropy") +
  theme_minimal(base_size = 12)

# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")

ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
  geom_point(colour = "#e66101", size = 4) +
  geom_segment(aes(xend = reorder(President, entropy), yend = 0),
               colour = "#e66101", linewidth = .8) +
  coord_flip() +
  labs(title = "Document-Level Entropy",
       x = NULL, y = "Entropy") +
  theme_minimal(base_size = 12)


10 textstat_summary() — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

A quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.

summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
             caption = "Corpus Summary — Speeches Since 1965",
             digits  = 1)
Corpus Summary — Speeches Since 1965
document chars sents tokens types puncts numbers symbols urls tags emojis
1965-Johnson 8205 93 1710 535 221 3 0 0 0 0
1969-Nixon 11644 103 2416 714 292 0 0 0 0 0
1973-Nixon 10007 68 1995 515 193 1 0 0 0 0
1977-Carter 6878 52 1370 501 146 3 0 0 0 0
1981-Reagan 13743 129 2781 850 349 1 0 0 0 0
1985-Reagan 14572 123 2909 876 345 11 0 0 0 0
1989-Bush 12529 141 2674 756 357 2 0 0 0 0
1993-Clinton 9113 81 1833 605 235 0 0 0 0 0
1997-Clinton 12262 111 2436 726 279 0 0 0 0 0
2001-Bush 9054 97 1806 592 222 1 0 0 0 0
2005-Bush 11923 99 2312 734 241 0 0 0 0 0
2009-Obama 13460 110 2689 900 299 0 0 0 0 0
2013-Obama 11917 88 2317 786 220 5 0 0 0 0
2017-Trump 8433 88 1660 547 215 2 0 0 0 0
2021-Biden 13133 216 2766 744 394 6 0 0 0 0
2025-Trump 17077 177 3347 950 434 4 0 0 0 0
summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
             caption = "Tokens Summary (after preprocessing)",
             digits  = 1)
Tokens Summary (after preprocessing)
document chars sents tokens types puncts numbers symbols urls tags emojis
1965-Johnson NA NA 691 440 0 0 0 0 0 0
1969-Nixon NA NA 1028 621 0 0 0 0 0 0
1973-Nixon NA NA 851 421 0 0 0 0 0 0
1977-Carter NA NA 592 411 0 0 0 0 0 0
1981-Reagan NA NA 1146 740 0 0 0 0 0 0
1985-Reagan NA NA 1291 758 0 0 0 0 0 0
1989-Bush NA NA 1092 643 0 0 0 0 0 0
1993-Clinton NA NA 798 517 0 0 0 0 0 0
1997-Clinton NA NA 1130 633 0 0 0 0 0 0
2001-Bush NA NA 783 503 0 0 0 0 0 0
2005-Bush NA NA 1041 643 0 0 0 0 0 0
2009-Obama NA NA 1173 801 0 0 0 0 0 0
2013-Obama NA NA 1031 681 0 0 0 0 0 0
2017-Trump NA NA 713 457 0 0 0 0 0 0
2021-Biden NA NA 1127 628 0 0 0 0 0 0
2025-Trump NA NA 1448 842 0 0 0 0 0 0

11 A Zipf-szabály

A Zipf-szabály egy statisztikai törvény, amely legérthetőbb megfogalmazása szerint egy adott rendszerben (például egy nyelvben) az elemek gyakorisága és a gyakorisági rangsorban elfoglalt helyük között fordított arányosság áll fenn. A szabály azt mutatja meg, hogy a világban sok látszólag kaotikus rendszer (mint az internetes forgalom, vagy a jövedelmek eloszlása) valójában egy szigorú matematikai mintát követ.

Az elméleti Zipf-szabály szemléltetése
Az elméleti Zipf-szabály szemléltetése

11.1 A Zipf-szabály táblázatban

#Install the zipfR package
#install.packages("zipfR")

#Load the package
library(zipfR)

#Load necessary libraries
library(ggplot2)

#Define parameters
N <- 100   # Total number of elements
s <- 1.5   # Shape parameter

#Generate Zipf distribution probabilities
zipf_probs <- (1 / (1:N)^s) / sum(1 / (1:N)^s)
zipf_data <- data.frame(Rank = 1:N, Probability = zipf_probs)

#Display the first few rows
print(zipf_data)
##     Rank  Probability
## 1      1 0.4144435056
## 2      2 0.1465279066
## 3      3 0.0797596898
## 4      4 0.0518054382
## 5      5 0.0370689541
## 6      6 0.0281993088
## 7      7 0.0223778459
## 8      8 0.0183159883
## 9      9 0.0153497595
## 10    10 0.0131058544
## 11    11 0.0113599471
## 12    12 0.0099699612
## 13    13 0.0088419959
## 14    14 0.0079117633
## 15    15 0.0071339235
## 16    16 0.0064756798
## 17    17 0.0059127832
## 18    18 0.0054269595
## 19    19 0.0050042032
## 20    20 0.0046336193
## 21    21 0.0043066184
## 22    22 0.0040163478
## 23    23 0.0037572802
## 24    24 0.0035249136
## 25    25 0.0033155480
## 26    26 0.0031261176
## 27    27 0.0029540626
## 28    28 0.0027972307
## 29    29 0.0026538009
## 30    30 0.0025222229
## 31    31 0.0024011694
## 32    32 0.0022894985
## 33    33 0.0021862228
## 34    34 0.0020904846
## 35    35 0.0020015354
## 36    36 0.0019187199
## 37    37 0.0018414620
## 38    38 0.0017692530
## 39    39 0.0017016429
## 40    40 0.0016382318
## 41    41 0.0015786634
## 42    42 0.0015226196
## 43    43 0.0014698149
## 44    44 0.0014199934
## 45    45 0.0013729242
## 46    46 0.0013283992
## 47    47 0.0012862298
## 48    48 0.0012462452
## 49    49 0.0012082901
## 50    50 0.0011722233
## 51    51 0.0011379157
## 52    52 0.0011052495
## 53    53 0.0010741169
## 54    54 0.0010444188
## 55    55 0.0010160646
## 56    56 0.0009889704
## 57    57 0.0009630594
## 58    58 0.0009382603
## 59    59 0.0009145076
## 60    60 0.0008917404
## 61    61 0.0008699025
## 62    62 0.0008489416
## 63    63 0.0008288091
## 64    64 0.0008094600
## 65    65 0.0007908522
## 66    66 0.0007729465
## 67    67 0.0007557065
## 68    68 0.0007390979
## 69    69 0.0007230889
## 70    70 0.0007076496
## 71    71 0.0006927520
## 72    72 0.0006783699
## 73    73 0.0006644787
## 74    74 0.0006510551
## 75    75 0.0006380775
## 76    76 0.0006255254
## 77    77 0.0006133795
## 78    78 0.0006016216
## 79    79 0.0005902347
## 80    80 0.0005792024
## 81    81 0.0005685096
## 82    82 0.0005581418
## 83    83 0.0005480854
## 84    84 0.0005383273
## 85    85 0.0005288554
## 86    86 0.0005196581
## 87    87 0.0005107242
## 88    88 0.0005020435
## 89    89 0.0004936059
## 90    90 0.0004854020
## 91    91 0.0004774229
## 92    92 0.0004696600
## 93    93 0.0004621053
## 94    94 0.0004547509
## 95    95 0.0004475895
## 96    96 0.0004406142
## 97    97 0.0004338182
## 98    98 0.0004271951
## 99    99 0.0004207388
## 100  100 0.0004144435

11.2 A Zipf-szabály mint hatványfüggvény

#Basic Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
  geom_line(color = "brown", size = .75) +
  labs(title = "Basic Zipf Distribution",
       x = "Rank",
       y = "Probability") +
  theme_minimal()

11.3 A Zipf-szabály log-log skálán

#Log10 Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
  geom_line(color = "brown", size = .75) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Log/log Scale Zipf Distribution",
       x = "Rank",
       y = "Probability") +
  theme_minimal()

11.4 A Zipf-szabály Jane Austennél

#— Letöltés & corpus pipeline —

library(gutenbergr)
austen_muvek <- gutenberg_download(
  c(1342, 161, 141, 105),
  meta_fields = "title"
)

austen_corpus <- austen_muvek %>%
  group_by(title) %>%
  summarise(text = paste(text, collapse = " ")) %>%
  corpus(text_field = "text", docid_field = "title")

austen_dfm <- austen_corpus %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("en")) %>%
  dfm()

11.4.1 Táblázatban

#--- Top szavak kinyerése ---
top_words <- topfeatures(austen_dfm, n = 30, groups = docnames(austen_dfm))

muvek_lista <- lapply(names(top_words), function(mu) {
  data.frame(
    Szó        = names(top_words[[mu]]),
    Gyakoriság = unname(top_words[[mu]])
  )
})
tabla_egyutt <- do.call(cbind, muvek_lista)
rownames(tabla_egyutt) <- 1:30

#Fejléc: rövidített műcímek (helytakarékosság)
cimek <- c("Pride & Prejudice", "Sense & Sensibility", "Mansfield Park", "Persuasion")
col_nevek <- rep(c("Szó", "N"), 4)

tabla_egyutt %>%
  kable("html",
        caption  = "Top 30 szó – 4 Austen-mű",
        col.names = col_nevek) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  add_header_above(setNames(rep(2, 4), cimek)) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")
Top 30 szó – 4 Austen-mű
Pride & Prejudice
Sense & Sensibility
Mansfield Park
Persuasion
Szó N Szó N Szó N Szó N
fanny 814 anne 446 mr 807 elinor 619
must 498 captain 302 elizabeth 605 mrs 529
crawford 492 mrs 291 said 406 marianne 489
mr 481 mr 254 darcy 383 said 397
much 456 elliot 254 mrs 353 every 375
miss 431 one 230 much 335 one 304
mrs 408 must 227 miss 315 much 290
said 406 lady 214 must 312 must 278
sir 372 much 205 bennet 309 time 237
one 365 wentworth 191 one 284 know 231
edmund 361 little 176 jane 273 dashwood 229
think 349 said 173 bingley 262 now 223
now 332 good 169 know 241 well 217
might 324 might 165 though 238 edward 217
little 309 now 157 never 228 sister 216
time 306 never 154 think 222 might 215
well 298 charles 153 can 218 though 215
nothing 298 time 151 well 217 miss 209
thomas 289 sir 149 soon 216 think 208
good 279 think 149 now 208 mother 203
never 278 well 145 might 206 can 199
without 269 nothing 138 may 203 jennings 199
know 251 great 128 time 200 never 186
can 244 know 126 lady 190 nothing 183
bertram 240 miss 125 little 190 thing 183
first 237 see 123 nothing 185 soon 179
soon 222 man 123 every 185 mr 178
see 219 walter 123 without 177 willoughby 178
though 216 soon 122 sister 177 without 173
every 214 mary 121 good 174 may 172

11.4.2 Grafikonon

szinek <- c(
  "Pride and Prejudice"   = "#e74c3c",
  "Sense and Sensibility" = "#2980b9",
  "Mansfield Park"        = "#27ae60",
  "Persuasion"            = "#8e44ad"
)

freq_muvek <- textstat_frequency(austen_dfm, groups = docnames(austen_dfm)) %>%
  rename(mu = group)

plot_ly() %>%
  {
    p <- .
    for (m in names(szinek)) {
      df_m <- filter(freq_muvek, mu == m)
      p <- add_trace(p,
        data = df_m,
        x    = ~rank,
        y    = ~frequency,
        type = "scatter", mode = "markers",
        name = m,
        marker = list(size = 3, opacity = 0.5,
                      color = szinek[[m]]),
        hovertemplate = paste0("<b>", m, "</b><br>",
                               "szó: %{text}<br>",
                               "rang: %{x}<br>",
                               "előfordulás: %{y}<extra></extra>"),
        text = ~feature
      )
    }
    p
  } %>%
  plotly::layout(
    title    = "Zipf-eloszlás (eredeti skála) – 4 Austen-mű",
    xaxis    = list(title = "Rang"),
    yaxis    = list(title = "Gyakoriság (előfordulások száma)"),
    legend   = list(title = list(text = "<b>Mű</b>")),
    template = "plotly_white"
  )

11.5 A Zipf-szabály és az evangéliumok

11.5.1 Táblázatban

#Get the data
library(readxl) 
#Make frequency tables 
library(tidyverse)

evm <- read_xlsx("Mt_short.xlsx")
freqtab1 <- evm %>% count(FullWord, sort=TRUE)
top50Mt <- freqtab1[1:50,]
Mt_total <- sum(freqtab1$n)
#
evm <- read_xlsx("Mk_short.xlsx")
freqtab2 <- evm %>% count(FullWord, sort=TRUE)
top50Mk <- freqtab2[1:50,]
Mk_total <- sum(freqtab2$n)
#
evm <- read_xlsx("Lk_short.xlsx")
freqtab3 <- evm %>% count(FullWord, sort=TRUE)
top50Lk <- freqtab3[1:50,]
Lk_total <- sum(freqtab3$n)
#
evm <- read_xlsx("Jn_short.xlsx")
freqtab4 <- evm %>% count(FullWord, sort=TRUE)
top50Jn <- freqtab4[1:50,]
Jn_total <- sum(freqtab4$n)
#
evmtab50 <- cbind(top50Mt,top50Mk,top50Lk,top50Jn)
names(evmtab50) <- c("Szó(Mt)","n","Szó(Mk)","n","Szó(Lk)","n","Szó(Jn)","n")
evmtab50

11.5.2 Grafikonon

#install.packages("plotly")
library(plotly)
datus <- data.frame(Roll_number = 1:50, 
                          y1 = top50Mt$n,
                          y2 = top50Mk$n,
                          y3 = top50Lk$n,
                          y4 = top50Jn$n)
#
fig <-plotly::plot_ly(data = datus, x = ~Roll_number,
                      y = ~y1, name = "Mt",
                      type = "scatter",mode = "lines") %>%
  add_trace(y = ~y2, name = "Mk") %>% 
  add_trace(y = ~y4, name = "Jn") %>%
  add_trace(y = ~y3, name = "Lk") %>% 
  layout(title = 'Zipfs law and the gospels', xaxis = list(title = 'Helyezés'),
         yaxis = list(title = 'Előfordulás'), legend = list(title=list(text='Legend Title')))
  
fig

11.6 The largest cities by population in the world / A világ legnépesebb városai

11.6.1 In datatable


``` r
library(readr)
bigs <- read_csv("largest-cities-by-population-2026.csv")

library(reactable)

reactable(bigs,
  searchable = TRUE,        # keresőmező
  filterable = TRUE,        # oszloponkénti szűrő
  striped = TRUE,           # csíkos sorok
  highlight = TRUE,         # hover kiemelés
  defaultPageSize = 25,     # sorok oldalanként
  height = 500              # görgethető magasság
)

11.6.2 On graph

library(plotly)
library(dplyr)
bigs <- read_csv("largest-cities-by-population-2026.csv")

fig <- plot_ly(data = bigs, x = ~rank, y = ~pop2026,
                      text = ~city,
                      name = "Biggest cities of the world",
                      type = "scatter",mode = "lines")

fig

12 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

Function Input What it computes
textstat_frequency() DFM Term frequencies & document frequencies
textstat_lexdiv() Tokens Lexical diversity (TTR, MATTR, MTLD, …)
textstat_readability() Corpus Readability indices (Flesch, FOG, SMOG, …)
textstat_dist() DFM Pairwise document/feature distances
textstat_simil() DFM Pairwise document/feature similarities
textstat_keyness() DFM Keyness of terms in target vs reference
textstat_collocations() Tokens Multi-word collocations (λ, z-scores)
textstat_entropy() DFM Shannon entropy per doc or feature
textstat_summary() Corpus/Tokens/DFM Token, type & sentence counts per doc