quanteda.textstats is a companion package to quanteda that provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset. At the end of the document, I also introduce a few other R packages related to lexicology.
A quanteda.textstats a quanteda kiegészítő csomagja, amely statisztikai elemzési függvényeket biztosít szöveges objektumokhoz (korpuszokhoz, tokenekhez és dokumentum-jellemző mátrixokhoz). Ez a dokumentum a beépített data_corpus_inaugural adatbázist felhasználva, reprodukálható példákkal mutatja be az egyes főbb függvényeket. A dokumentum végén még néhány egyéb, lexikológia tárgykörhöz készült R csomagot is bemutatok.

1 Setup & Data Preparation / Beállítás és adatok előkészítése

# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
#                    "quanteda.textplots", "ggplot2", "dplyr"))

library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)
library(paletteer)
library(plotly)
library(kableExtra)
library(textstem)
tokens <- quanteda::tokens   # pin tokens() to quanteda for this session

# ── Build a DFM from US inaugural speeches ─────────
corp <- data_corpus_inaugural 
toks <- corp |>
  quanteda::tokens(remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE) |>
  quanteda::tokens_remove(pattern = stopwords("en"))

dfm_all <- dfm(toks)

# Subset: last 15 speeches for compact comparisons
corp15   <- corpus_subset(corp, Year >= 1965)

toks15 <- corp15 |>
  quanteda::tokens(remove_punct  = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE) |>
  quanteda::tokens_remove(pattern = stopwords("en"))

dfm15    <- dfm(toks15)

cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")

## Full DFM: 60 docs × 9359 features

cat("Subset DFM:", ndoc(dfm15),  "docs ×", nfeat(dfm15),  "features\n")

## Subset DFM: 16 docs × 3993 features

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

textstat_frequency() returns term frequencies across the whole corpus or within groups. Ez a függvény szógyakoriságokat ad vissza az egész korpuszban, vagy annak csoportjaiban. The docfreq column shows the number of documents in which the word appears at least once. A docfreq oszlop azon dokumentumok számát mutatja, amelyekben a szó legalább egyszer előfordul.

# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)

knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
             caption = "Top 20 Stems — All Inaugural Speeches",
             align   = "lrrrr")

Top 20 Stems — All Inaugural Speeches
feature	frequency	rank	docfreq
people	592	1	58
government	575	2	53
us	507	3	57
can	489	4	57
must	377	5	53
upon	371	6	47
great	354	7	57
may	343	8	54
states	343	8	48
world	329	10	54
nation	324	11	55
country	323	12	55
shall	316	13	51
every	309	14	53
one	274	15	50
peace	259	16	48
new	256	17	51
power	246	18	49
now	238	19	54
public	227	20	43

freq20 <- textstat_frequency(dfm_all, n = 20)

ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2c7bb6", alpha = .85) +
  coord_flip() +
  labs(title    = "Top 20 Terms — US Inaugural Addresses",
       subtitle = "After stopword removal and stemming",
       x = NULL, y = "Total Frequency") +
  theme_minimal(base_size = 13)

# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
                                       "21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)

ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
                     fill = group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~group, scales = "free_y") +
  coord_flip() +
  scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
  labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
  theme_minimal(base_size = 12)

2.1 Textstat-freqency - The President … / Az elnök

2.2 … and his favourite words / … és kedvenc szavai

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatosság

Measures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.

TTR (Type-Token Ratio - Szófaj-Szó Arány)

Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.

MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)

Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.

ld <- textstat_lexdiv(toks15,
                      measure = c("TTR", "MATTR", "MTLD"))

# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year      <- docvars(corp15, "Year")

# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))

knitr::kable(ld[, show_cols],
             digits  = 3,
             caption = "Lexical Diversity — Speeches Since 1965")

Lexical Diversity — Speeches Since 1965
Year	President	TTR	MATTR
1965	Johnson	0.637	0.853
1969	Nixon	0.604	0.834
1973	Nixon	0.495	0.743
1977	Carter	0.694	0.867
1981	Reagan	0.646	0.867
1985	Reagan	0.587	0.857
1989	Bush	0.589	0.846
1993	Clinton	0.648	0.856
1997	Clinton	0.560	0.836
2001	Bush	0.642	0.886
2005	Bush	0.618	0.862
2009	Obama	0.683	0.912
2013	Obama	0.661	0.902
2017	Trump	0.641	0.834
2021	Biden	0.557	0.843
2025	Trump	0.581	0.874

# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))

ld_long <- tidyr::pivot_longer(ld,
                                cols      = all_of(measure_cols),
                                names_to  = "Measure",
                                values_to = "Score")

ggplot(ld_long, aes(x = factor(Year), y = Score,
                     group = Measure, colour = Measure)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  facet_wrap(~Measure, scales = "free_y", ncol = 1) +
  labs(title = "Lexical Diversity Over Time",
       x = "Year", y = "Score") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

Quantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.

# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)

rd <- textstat_readability(corp15_nostop,
                           measure = c("Flesch",
                                       "Flesch.Kincaid",
                                       "FOG",
                                       "SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year      <- docvars(corp15_nostop, "Year")

knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
             digits  = 2,
             caption = "Readability Scores — Speeches Since 1965")

Readability Scores — Speeches Since 1965
Year	President	Flesch	Flesch.Kincaid	FOG	SMOG
1965	Johnson	69.41	7.56	10.41	10.36
1969	Nixon	65.58	9.24	12.05	11.13
1973	Nixon	54.19	12.30	15.20	13.10
1977	Carter	53.38	11.67	14.55	13.06
1981	Reagan	58.75	9.76	12.92	12.22
1985	Reagan	57.58	10.42	13.48	12.47
1989	Bush	73.10	7.15	9.98	9.88
1993	Clinton	55.81	10.38	13.20	12.37
1997	Clinton	59.22	9.83	12.69	11.96
2001	Bush	60.12	8.93	11.63	11.37
2005	Bush	53.19	11.04	14.11	13.02
2009	Obama	60.53	10.23	12.71	11.55
2013	Obama	53.56	11.73	14.51	12.95
2017	Trump	58.58	9.17	12.16	11.78
2021	Biden	73.20	5.78	8.74	9.37
2025	Trump	55.08	9.67	12.64	12.15

ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
  geom_smooth(method = "loess", se = TRUE,
              colour = "#f46d43", fill = "#fee090", linewidth = 1) +
  geom_point(colour = "#4393c3", size = 3) +
  ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
  labs(title    = "Flesch Reading Ease Over Time",
       subtitle = "Higher = easier to read",
       x = "Year", y = "Flesch Score") +
  theme_minimal(base_size = 13)

5 `textstat_dist()` — Document Distance / Dokumentum távolság

Computes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.

dist_mat <- textstat_dist(dfm15, method = "euclidean")

# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))

dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15

# Heatmap via ggplot
dist_long <- dist_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")

ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
  geom_tile() +
  scale_fill_distiller(palette = "RdYlBu", direction = -1) +
  labs(title = "Euclidean Distance Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

The counterpart to textstat_dist(): higher values mean more similar documents. Cosine similarity is the most popular choice for text.

sim_mat <- textstat_simil(dfm15, method = "cosine")

# Same unique labels as above
sim_df  <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15

sim_long <- sim_df |>
  as.data.frame() |>
  tibble::rownames_to_column("Doc1") |>
  tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")

ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
  geom_tile() +
  scale_fill_distiller(palette = "YlGn", direction = 1) +
  labs(title = "Cosine Similarity Between Speeches",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

Identifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.

# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)

key <- textstat_keyness(dfm_cent, target = "21st")

head(key, 15) |>
  knitr::kable(digits  = 3,
               caption = "Keyness: 21st-Century vs 20th-Century Speeches")

Keyness: 21st-Century vs 20th-Century Speeches
feature	chi2	n_target	n_reference
america	192.347	88	132
thank	187.142	34	11
americans	104.360	40	50
story	100.626	18	5
jobs	73.151	12	2
today	53.647	37	78
america’s	50.472	18	19
borders	44.372	9	3
nation	42.932	72	252
day	42.776	30	64
workers	40.756	7	1
raging	40.288	6	0
american	40.158	47	138
back	39.316	18	25
president	38.315	31	73

# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
  quanteda.textplots::textplot_keyness(key,
                                       n      = 15,
                                       labelcolor = "grey30") +
    labs(title = "Keyness Plot — 21st vs 20th Century") +
    theme_minimal(base_size = 12)
} else {
  # Fallback ggplot version
  key_top <- rbind(head(key, 10), tail(key, 10))
  key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")

  ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("21st century" = "#1a9641",
                                 "20th century" = "#d7191c")) +
    labs(title = "Keyness: Top Distinctive Terms",
         x = NULL, y = "Chi-squared statistic", fill = NULL) +
    theme_minimal(base_size = 12)
}

8 `textstat_collocations()` — Collocations / Szavak egybeesései

Finds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.

# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)

col <- textstat_collocations(toks_raw,
                             size    = 2,        # bigrams
                             min_count = 5)      # at least 5 occurrences

head(col, 20) |>
  knitr::kable(digits  = 3,
               caption = "Top Bigram Collocations (λ statistic)")

Top Bigram Collocations (λ statistic)
collocation	count	length	lambda	z
of the	1786	2	1.563	53.069
it is	327	2	3.541	51.057
has been	188	2	5.200	50.397
have been	209	2	4.758	49.254
those who	130	2	5.817	45.564
we have	270	2	3.371	45.065
united states	165	2	7.935	43.578
of our	635	2	2.029	41.981
will be	220	2	3.337	41.329
in the	828	2	1.709	40.143
let us	101	2	6.342	37.517
should be	140	2	4.301	37.515
we are	187	2	3.127	36.252
we will	202	2	2.971	36.225
may be	126	2	4.043	35.165
fellow citizens	79	2	7.822	34.764
i shall	96	2	4.323	34.030
we must	128	2	3.688	33.173
must be	117	2	3.782	33.094
there is	104	2	4.150	32.331

col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
  knitr::kable(digits  = 3,
               caption = "Top Trigram Collocations")

Top Trigram Collocations
collocation	count	length	lambda	z
of which the	11	3	3.065	7.970
all of us	15	3	4.455	7.394
in which the	14	3	2.486	7.161
than that of	8	3	4.966	6.965
is not the	15	3	2.523	6.733
is that of	5	3	3.498	6.448
as that of	4	3	4.885	6.393
the american people	40	3	5.651	6.307
of president of	6	3	4.783	6.269
to that of	4	3	3.389	6.222

# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english"))

# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)

# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures

# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)

# 4. Hálózati diagram kirajzolása
library(ggplot2)
textplot_network(fcm_subset, 
                 min_freq = 0.9, 
                 vertex_labelsize = 5,
                 vertex_color = "#E41A1C",    # Pirosas csomópontok
                 edge_color = "#377EB8",      # Kékes élek
                 edge_alpha = 0.4,            # Halvány élek a jobb olvashatóságért
                 vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
                    labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
                    hjust = 0.5,
                    subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
                    caption = "Forrás: data_corpus_inaugural")

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

Shannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs. A szavak entrópiájának vizsgálata: egy szó, amely csak egy dokumentumban fordul elő, alacsony entrópiával rendelkezik, míg egy olyan szó, amely sok dokumentumban egyenletesen eloszlik, magas entrópiával bír és csak néhány dokumentumban koncentrálódik.

A Shannon-entrópia egy vélhetően bekövetkező esemény bizonytalanságát méri. A képlete a következő:

\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]

Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége. A függvény akkor éri el a maximumát, amikor a rendszer a lehető legkevésbé jósolható előre, azaz egyenletes az eloszlás. Ez azt jelenti, hogy minden lehetséges kimenetel pontosan ugyanakkora valószínűséggel következik be. Képletben: p1 = p2 = p3 = … = pn = 1/n A maximum értékéhez a feltételt a képletbe helyettesítve: \[H_{max} = \log_{2}(n) \text{ bit}\] A Shannon-entrópia értéke mindig nem negatív, és a rendszer bizonytalanságának mértékét jelzi. Minél magasabb az entrópia, annál nagyobb a bizonytalanság, és annál több információt tartalmaz a rendszer. A legegyszerűbb példa az érmefeldobás: egy tisztességes érme esetén két kimenetel van (fej vagy írás), mindkettő valószínűsége 0,5. Ebben az esetben a Shannon-entrópia értéke: \[H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \text{ bit}\] Ez azt jelenti, hogy egy tisztességes érme feldobása 1 bit információt tartalmaz, és a kimenetel teljesen bizonytalan. Ha az érme csaló lenne, és mindig fej lenne, akkor a Shannon-entrópia értéke 0 lenne, mivel nincs bizonytalanság: \[H(X) = - (1 \log_2 1 + 0 \log_2 0) = 0 \text{ bit}\]

ent <- textstat_entropy(dfm_all, margin = "features")

ent_top <- ent |>
  arrange(desc(entropy)) |>
  head(20)

ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
  geom_col(fill = "#7b3294", alpha = .8) +
  coord_flip() +
  labs(title    = "Top 20 Features by Entropy",
       subtitle = "High entropy = term used across many documents evenly",
       x = NULL, y = "Shannon Entropy") +
  theme_minimal(base_size = 12)

# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")

ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
  geom_point(colour = "#e66101", size = 4) +
  geom_segment(aes(xend = reorder(President, entropy), yend = 0),
               colour = "#e66101", linewidth = .8) +
  coord_flip() +
  labs(title = "Document-Level Entropy",
       x = NULL, y = "Entropy") +
  theme_minimal(base_size = 12)

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

A quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.

summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
             caption = "Corpus Summary — Speeches Since 1965",
             digits  = 1)

Corpus Summary — Speeches Since 1965
document	chars	sents	tokens	types	puncts	numbers
1965-Johnson	8205	93	1710	535	221	3
1969-Nixon	11644	103	2416	714	292	0
1973-Nixon	10007	68	1995	515	193	1
1977-Carter	6878	52	1370	501	146	3
1981-Reagan	13743	129	2781	850	349	1
1985-Reagan	14572	123	2909	876	345	11
1989-Bush	12529	141	2674	756	357	2
1993-Clinton	9113	81	1833	605	235	0
1997-Clinton	12262	111	2436	726	279	0
2001-Bush	9054	97	1806	592	222	1
2005-Bush	11923	99	2312	734	241	0
2009-Obama	13460	110	2689	900	299	0
2013-Obama	11917	88	2317	786	220	5
2017-Trump	8433	88	1660	547	215	2
2021-Biden	13133	216	2766	744	394	6
2025-Trump	17077	177	3347	950	434	4

summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
             caption = "Tokens Summary (after preprocessing)",
             digits  = 1)

Tokens Summary (after preprocessing)
document	chars	sents	tokens	types
1965-Johnson	NA	NA	691	440
1969-Nixon	NA	NA	1028	621
1973-Nixon	NA	NA	851	421
1977-Carter	NA	NA	592	411
1981-Reagan	NA	NA	1146	740
1985-Reagan	NA	NA	1291	758
1989-Bush	NA	NA	1092	643
1993-Clinton	NA	NA	798	517
1997-Clinton	NA	NA	1130	633
2001-Bush	NA	NA	783	503
2005-Bush	NA	NA	1041	643
2009-Obama	NA	NA	1173	801
2013-Obama	NA	NA	1031	681
2017-Trump	NA	NA	713	457
2021-Biden	NA	NA	1127	628
2025-Trump	NA	NA	1448	842

11 A Zipf-szabály

A Zipf-szabály egy statisztikai törvény, amely legérthetőbb megfogalmazása szerint egy adott rendszerben (például egy nyelvben) az elemek gyakorisága és a gyakorisági rangsorban elfoglalt helyük között fordított arányosság áll fenn. A szabály azt mutatja meg, hogy a világban sok látszólag kaotikus rendszer (mint az internetes forgalom, vagy a jövedelmek eloszlása) valójában egy szigorú matematikai mintát követ.

Az elméleti Zipf-szabály szemléltetése

11.1 A Zipf-szabály táblázatban

#Install the zipfR package
#install.packages("zipfR")

#Load the package
library(zipfR)

#Load necessary libraries
library(ggplot2)

#Define parameters
N <- 100   # Total number of elements
s <- 1.5   # Shape parameter

#Generate Zipf distribution probabilities
zipf_probs <- (1 / (1:N)^s) / sum(1 / (1:N)^s)
zipf_data <- data.frame(Rank = 1:N, Probability = zipf_probs)

#Display the first few rows
print(zipf_data)

##     Rank  Probability
## 1      1 0.4144435056
## 2      2 0.1465279066
## 3      3 0.0797596898
## 4      4 0.0518054382
## 5      5 0.0370689541
## 6      6 0.0281993088
## 7      7 0.0223778459
## 8      8 0.0183159883
## 9      9 0.0153497595
## 10    10 0.0131058544
## 11    11 0.0113599471
## 12    12 0.0099699612
## 13    13 0.0088419959
## 14    14 0.0079117633
## 15    15 0.0071339235
## 16    16 0.0064756798
## 17    17 0.0059127832
## 18    18 0.0054269595
## 19    19 0.0050042032
## 20    20 0.0046336193
## 21    21 0.0043066184
## 22    22 0.0040163478
## 23    23 0.0037572802
## 24    24 0.0035249136
## 25    25 0.0033155480
## 26    26 0.0031261176
## 27    27 0.0029540626
## 28    28 0.0027972307
## 29    29 0.0026538009
## 30    30 0.0025222229
## 31    31 0.0024011694
## 32    32 0.0022894985
## 33    33 0.0021862228
## 34    34 0.0020904846
## 35    35 0.0020015354
## 36    36 0.0019187199
## 37    37 0.0018414620
## 38    38 0.0017692530
## 39    39 0.0017016429
## 40    40 0.0016382318
## 41    41 0.0015786634
## 42    42 0.0015226196
## 43    43 0.0014698149
## 44    44 0.0014199934
## 45    45 0.0013729242
## 46    46 0.0013283992
## 47    47 0.0012862298
## 48    48 0.0012462452
## 49    49 0.0012082901
## 50    50 0.0011722233
## 51    51 0.0011379157
## 52    52 0.0011052495
## 53    53 0.0010741169
## 54    54 0.0010444188
## 55    55 0.0010160646
## 56    56 0.0009889704
## 57    57 0.0009630594
## 58    58 0.0009382603
## 59    59 0.0009145076
## 60    60 0.0008917404
## 61    61 0.0008699025
## 62    62 0.0008489416
## 63    63 0.0008288091
## 64    64 0.0008094600
## 65    65 0.0007908522
## 66    66 0.0007729465
## 67    67 0.0007557065
## 68    68 0.0007390979
## 69    69 0.0007230889
## 70    70 0.0007076496
## 71    71 0.0006927520
## 72    72 0.0006783699
## 73    73 0.0006644787
## 74    74 0.0006510551
## 75    75 0.0006380775
## 76    76 0.0006255254
## 77    77 0.0006133795
## 78    78 0.0006016216
## 79    79 0.0005902347
## 80    80 0.0005792024
## 81    81 0.0005685096
## 82    82 0.0005581418
## 83    83 0.0005480854
## 84    84 0.0005383273
## 85    85 0.0005288554
## 86    86 0.0005196581
## 87    87 0.0005107242
## 88    88 0.0005020435
## 89    89 0.0004936059
## 90    90 0.0004854020
## 91    91 0.0004774229
## 92    92 0.0004696600
## 93    93 0.0004621053
## 94    94 0.0004547509
## 95    95 0.0004475895
## 96    96 0.0004406142
## 97    97 0.0004338182
## 98    98 0.0004271951
## 99    99 0.0004207388
## 100  100 0.0004144435

11.2 A Zipf-szabály mint hatványfüggvény

#Basic Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
  geom_line(color = "brown", size = .75) +
  labs(title = "Basic Zipf Distribution",
       x = "Rank",
       y = "Probability") +
  theme_minimal()

11.3 A Zipf-szabály log-log skálán

#Log10 Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
  geom_line(color = "brown", size = .75) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Log/log Scale Zipf Distribution",
       x = "Rank",
       y = "Probability") +
  theme_minimal()

11.4 A Zipf-szabály Jane Austennél

#— Letöltés & corpus pipeline —

library(gutenbergr)
austen_muvek <- gutenberg_download(
  c(1342, 161, 141, 105),
  meta_fields = "title"
)

austen_corpus <- austen_muvek %>%
  group_by(title) %>%
  summarise(text = paste(text, collapse = " ")) %>%
  corpus(text_field = "text", docid_field = "title")

austen_dfm <- austen_corpus %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("en")) %>%
  dfm()

11.4.1 Táblázatban

#--- Top szavak kinyerése ---
top_words <- topfeatures(austen_dfm, n = 30, groups = docnames(austen_dfm))

muvek_lista <- lapply(names(top_words), function(mu) {
  data.frame(
    Szó        = names(top_words[[mu]]),
    Gyakoriság = unname(top_words[[mu]])
  )
})
tabla_egyutt <- do.call(cbind, muvek_lista)
rownames(tabla_egyutt) <- 1:30

#Fejléc: rövidített műcímek (helytakarékosság)
cimek <- c("Pride & Prejudice", "Sense & Sensibility", "Mansfield Park", "Persuasion")
col_nevek <- rep(c("Szó", "N"), 4)

tabla_egyutt %>%
  kable("html",
        caption  = "Top 30 szó – 4 Austen-mű",
        col.names = col_nevek) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  add_header_above(setNames(rep(2, 4), cimek)) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")

Top 30 szó – 4 Austen-mű
Pride & Prejudice		Sense & Sensibility		Mansfield Park		Persuasion
Szó	N	Szó	N	Szó	N	Szó	N
fanny	814	anne	446	mr	807	elinor	619
must	498	captain	302	elizabeth	605	mrs	529
crawford	492	mrs	291	said	406	marianne	489
mr	481	mr	254	darcy	383	said	397
much	456	elliot	254	mrs	353	every	375
miss	431	one	230	much	335	one	304
mrs	408	must	227	miss	315	much	290
said	406	lady	214	must	312	must	278
sir	372	much	205	bennet	309	time	237
one	365	wentworth	191	one	284	know	231
edmund	361	little	176	jane	273	dashwood	229
think	349	said	173	bingley	262	now	223
now	332	good	169	know	241	well	217
might	324	might	165	though	238	edward	217
little	309	now	157	never	228	sister	216
time	306	never	154	think	222	might	215
well	298	charles	153	can	218	though	215
nothing	298	time	151	well	217	miss	209
thomas	289	sir	149	soon	216	think	208
good	279	think	149	now	208	mother	203
never	278	well	145	might	206	can	199
without	269	nothing	138	may	203	jennings	199
know	251	great	128	time	200	never	186
can	244	know	126	lady	190	nothing	183
bertram	240	miss	125	little	190	thing	183
first	237	see	123	nothing	185	soon	179
soon	222	man	123	every	185	mr	178
see	219	walter	123	without	177	willoughby	178
though	216	soon	122	sister	177	without	173
every	214	mary	121	good	174	may	172

11.4.2 Grafikonon

szinek <- c(
  "Pride and Prejudice"   = "#e74c3c",
  "Sense and Sensibility" = "#2980b9",
  "Mansfield Park"        = "#27ae60",
  "Persuasion"            = "#8e44ad"
)

freq_muvek <- textstat_frequency(austen_dfm, groups = docnames(austen_dfm)) %>%
  rename(mu = group)

plot_ly() %>%
  {
    p <- .
    for (m in names(szinek)) {
      df_m <- filter(freq_muvek, mu == m)
      p <- add_trace(p,
        data = df_m,
        x    = ~rank,
        y    = ~frequency,
        type = "scatter", mode = "markers",
        name = m,
        marker = list(size = 3, opacity = 0.5,
                      color = szinek[[m]]),
        hovertemplate = paste0("<b>", m, "</b><br>",
                               "szó: %{text}<br>",
                               "rang: %{x}<br>",
                               "előfordulás: %{y}<extra></extra>"),
        text = ~feature
      )
    }
    p
  } %>%
  plotly::layout(
    title    = "Zipf-eloszlás (eredeti skála) – 4 Austen-mű",
    xaxis    = list(title = "Rang"),
    yaxis    = list(title = "Gyakoriság (előfordulások száma)"),
    legend   = list(title = list(text = "<b>Mű</b>")),
    template = "plotly_white"
  )

11.5 A Zipf-szabály és az evangéliumok

11.5.1 Táblázatban

#Get the data
library(readxl) 
#Make frequency tables 
library(tidyverse)

evm <- read_xlsx("Mt_short.xlsx")
freqtab1 <- evm %>% count(FullWord, sort=TRUE)
top50Mt <- freqtab1[1:50,]
Mt_total <- sum(freqtab1$n)
#
evm <- read_xlsx("Mk_short.xlsx")
freqtab2 <- evm %>% count(FullWord, sort=TRUE)
top50Mk <- freqtab2[1:50,]
Mk_total <- sum(freqtab2$n)
#
evm <- read_xlsx("Lk_short.xlsx")
freqtab3 <- evm %>% count(FullWord, sort=TRUE)
top50Lk <- freqtab3[1:50,]
Lk_total <- sum(freqtab3$n)
#
evm <- read_xlsx("Jn_short.xlsx")
freqtab4 <- evm %>% count(FullWord, sort=TRUE)
top50Jn <- freqtab4[1:50,]
Jn_total <- sum(freqtab4$n)
#
evmtab50 <- cbind(top50Mt,top50Mk,top50Lk,top50Jn)
names(evmtab50) <- c("Szó(Mt)","n","Szó(Mk)","n","Szó(Lk)","n","Szó(Jn)","n")
evmtab50

11.5.2 Grafikonon

#install.packages("plotly")
library(plotly)
datus <- data.frame(Roll_number = 1:50, 
                          y1 = top50Mt$n,
                          y2 = top50Mk$n,
                          y3 = top50Lk$n,
                          y4 = top50Jn$n)
#
fig <-plotly::plot_ly(data = datus, x = ~Roll_number,
                      y = ~y1, name = "Mt",
                      type = "scatter",mode = "lines") %>%
  add_trace(y = ~y2, name = "Mk") %>% 
  add_trace(y = ~y4, name = "Jn") %>%
  add_trace(y = ~y3, name = "Lk") %>% 
  layout(title = 'Zipfs law and the gospels', xaxis = list(title = 'Helyezés'),
         yaxis = list(title = 'Előfordulás'), legend = list(title=list(text='Legend Title')))
  
fig

11.6 The largest cities by population in the world / A világ legnépesebb városai

11.6.1 In datatable


``` r
library(readr)
bigs <- read_csv("largest-cities-by-population-2026.csv")

library(reactable)

reactable(bigs,
  searchable = TRUE,        # keresőmező
  filterable = TRUE,        # oszloponkénti szűrő
  striped = TRUE,           # csíkos sorok
  highlight = TRUE,         # hover kiemelés
  defaultPageSize = 25,     # sorok oldalanként
  height = 500              # görgethető magasság
)

11.6.2 On graph

library(plotly)
library(dplyr)
bigs <- read_csv("largest-cities-by-population-2026.csv")

fig <- plot_ly(data = bigs, x = ~rank, y = ~pop2026,
                      text = ~city,
                      name = "Biggest cities of the world",
                      type = "scatter",mode = "lines")

fig

12 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

Function	Input	What it computes
`textstat_frequency()`	DFM	Term frequencies & document frequencies
`textstat_lexdiv()`	Tokens	Lexical diversity (TTR, MATTR, MTLD, …)
`textstat_readability()`	Corpus	Readability indices (Flesch, FOG, SMOG, …)
`textstat_dist()`	DFM	Pairwise document/feature distances
`textstat_simil()`	DFM	Pairwise document/feature similarities
`textstat_keyness()`	DFM	Keyness of terms in target vs reference
`textstat_collocations()`	Tokens	Multi-word collocations (λ, z-scores)
`textstat_entropy()`	DFM	Shannon entropy per doc or feature
`textstat_summary()`	Corpus/Tokens/DFM	Token, type & sentence counts per doc

What can an R package do? / Mit tud egy R csomag? A showcase / Bemutató dokumentum

WR

2026-05-07

1 Setup & Data Preparation / Beállítás és adatok előkészítése

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

2.1 Textstat-freqency - The President … / Az elnök

2.2 … and his favourite words / … és kedvenc szavai

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatosság

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

5 `textstat_dist()` — Document Distance / Dokumentum távolság

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

8 `textstat_collocations()` — Collocations / Szavak egybeesései

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

11 A Zipf-szabály

11.1 A Zipf-szabály táblázatban

11.2 A Zipf-szabály mint hatványfüggvény

11.3 A Zipf-szabály log-log skálán

11.4 A Zipf-szabály Jane Austennél

11.4.1 Táblázatban

11.4.2 Grafikonon

11.5 A Zipf-szabály és az evangéliumok

11.5.1 Táblázatban

11.5.2 Grafikonon

11.6 The largest cities by population in the world / A világ legnépesebb városai

11.6.1 In datatable

11.6.2 On graph

12 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

What can an R package do? / Mit tud egy R csomag? A showcase / Bemutató dokumentum

WR

2026-05-07

1 Setup & Data Preparation / Beállítás és adatok előkészítése

2 textstat_frequency() — Term Frequency / Kifejezés gyakorisága

2.1 Textstat-freqency - The President … / Az elnök

2.2 … and his favourite words / … és kedvenc szavai

3 textstat_lexdiv() — Lexical Diversity / Szókincs-változatosság

4 textstat_readability() — Readability Scores / Olvashatósági mutatók

5 textstat_dist() — Document Distance / Dokumentum távolság

6 textstat_simil() — Document Similarity / Dokumentum hasonlóság

7 textstat_keyness() — Keyness Analysis / Kulcsszó-elemzés

8 textstat_collocations() — Collocations / Szavak egybeesései

9 textstat_entropy() — Shannon entropy / Shannon-féle entrópia

10 textstat_summary() — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló

11 A Zipf-szabály

11.1 A Zipf-szabály táblázatban

11.2 A Zipf-szabály mint hatványfüggvény

11.3 A Zipf-szabály log-log skálán

11.4 A Zipf-szabály Jane Austennél

11.4.1 Táblázatban

11.4.2 Grafikonon

11.5 A Zipf-szabály és az evangéliumok

11.5.1 Táblázatban

11.5.2 Grafikonon

11.6 The largest cities by population in the world / A világ legnépesebb városai

11.6.1 In datatable

11.6.2 On graph

12 Quick-Reference Cheat Sheet / Gyorsreferencia táblázat

2 `textstat_frequency()` — Term Frequency / Kifejezés gyakorisága

3 `textstat_lexdiv()` — Lexical Diversity / Szókincs-változatosság

4 `textstat_readability()` — Readability Scores / Olvashatósági mutatók

5 `textstat_dist()` — Document Distance / Dokumentum távolság

6 `textstat_simil()` — Document Similarity / Dokumentum hasonlóság

7 `textstat_keyness()` — Keyness Analysis / Kulcsszó-elemzés

8 `textstat_collocations()` — Collocations / Szavak egybeesései

9 `textstat_entropy()` — Shannon entropy / Shannon-féle entrópia

10 `textstat_summary()` — Corpus-Tokens-DFM Summary / Corpus-Tokens-DFM összefoglaló