What can an R package do? / Mit tud egy R csomag? A showcase / Bemutató dokumentum
quanteda.textstatsis a companion package toquantedathat provides statistical analysis functions for text objects (corpora, tokens, and document- feature matrices). This document walks through each major function with reproducible examples using the built-in data_corpus_inaugural dataset. At the end of the document, I also introduce a few other R packages related to lexicology.
Aquanteda.textstatsaquantedakiegészítő csomagja, amely statisztikai elemzési függvényeket biztosít szöveges objektumokhoz (korpuszokhoz, tokenekhez és dokumentum-jellemző mátrixokhoz). Ez a dokumentum a beépített data_corpus_inaugural adatbázist felhasználva, reprodukálható példákkal mutatja be az egyes főbb függvényeket. A dokumentum végén még néhány egyéb, lexikológia tárgykörhöz készült R csomagot is bemutatok.
# Install packages if needed:
# install.packages(c("quanteda", "quanteda.textstats",
# "quanteda.textplots", "ggplot2", "dplyr"))
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(ggplot2)
library(dplyr)
library(paletteer)
library(plotly)
library(kableExtra)
library(textstem)
tokens <- quanteda::tokens # pin tokens() to quanteda for this session
# ── Build a DFM from US inaugural speeches ─────────
corp <- data_corpus_inaugural
toks <- corp |>
quanteda::tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
quanteda::tokens_remove(pattern = stopwords("en"))
dfm_all <- dfm(toks)
# Subset: last 15 speeches for compact comparisons
corp15 <- corpus_subset(corp, Year >= 1965)
toks15 <- corp15 |>
quanteda::tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
quanteda::tokens_remove(pattern = stopwords("en"))
dfm15 <- dfm(toks15)
cat("Full DFM:", ndoc(dfm_all), "docs ×", nfeat(dfm_all), "features\n")## Full DFM: 60 docs × 9359 features
## Subset DFM: 16 docs × 3993 features
textstat_frequency() — Term Frequency / Kifejezés
gyakoriságatextstat_frequency() returns term frequencies across the
whole corpus or within groups. Ez a függvény szógyakoriságokat ad vissza
az egész korpuszban, vagy annak csoportjaiban. The docfreq
column shows the number of documents in which the word appears at least
once. A docfreq oszlop azon dokumentumok számát mutatja, amelyekben a
szó legalább egyszer előfordul.
# Top 20 terms overall
freq <- textstat_frequency(dfm_all, n = 20)
knitr::kable(freq[, c("feature","frequency","rank","docfreq")],
caption = "Top 20 Stems — All Inaugural Speeches",
align = "lrrrr")| feature | frequency | rank | docfreq |
|---|---|---|---|
| people | 592 | 1 | 58 |
| government | 575 | 2 | 53 |
| us | 507 | 3 | 57 |
| can | 489 | 4 | 57 |
| must | 377 | 5 | 53 |
| upon | 371 | 6 | 47 |
| great | 354 | 7 | 57 |
| may | 343 | 8 | 54 |
| states | 343 | 8 | 48 |
| world | 329 | 10 | 54 |
| nation | 324 | 11 | 55 |
| country | 323 | 12 | 55 |
| shall | 316 | 13 | 51 |
| every | 309 | 14 | 53 |
| one | 274 | 15 | 50 |
| peace | 259 | 16 | 48 |
| new | 256 | 17 | 51 |
| power | 246 | 18 | 49 |
| now | 238 | 19 | 54 |
| public | 227 | 20 | 43 |
freq20 <- textstat_frequency(dfm_all, n = 20)
ggplot(freq20, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "#2c7bb6", alpha = .85) +
coord_flip() +
labs(title = "Top 20 Terms — US Inaugural Addresses",
subtitle = "After stopword removal and stemming",
x = NULL, y = "Total Frequency") +
theme_minimal(base_size = 13)# Frequency within groups (20th vs 21st century)
docvars(dfm_all, "century") <- ifelse(docvars(dfm_all, "Year") >= 2000,
"21st", "20th")
freq_grp <- textstat_frequency(dfm_all, n = 10, groups = century)
ggplot(freq_grp, aes(x = reorder(feature, frequency), y = frequency,
fill = group)) +
geom_col(show.legend = FALSE) +
facet_wrap(~group, scales = "free_y") +
coord_flip() +
scale_fill_manual(values = c("20th" = "#d7191c", "21st" = "#1a9641")) +
labs(title = "Top Terms by Century", x = NULL, y = "Frequency") +
theme_minimal(base_size = 12)textstat_lexdiv() — Lexical Diversity /
Szókincs-változatosságMeasures how rich and varied the vocabulary is for each document. Several indices are available: TTR, MATTR, MTLD, MSTTR, etc.
TTR (Type-Token Ratio - Szófaj-Szó Arány)
Azt mutatja meg, hogy az adott szövegben mekkora az egyedi szavak aránya az összes szóhoz képest. Hátránya: nagyon érzékeny a szöveghosszra, így nem használható közvetlenül különböző hosszúságú szövegek összehasonlítására.
MATTR (Moving-Average Type-Token Ratio - Mozgóátlagos TTR)
Ez egy modernebb, a szöveghosszra kevésbé érzékeny mérőszám. Előnye: alkalmas a különböző hosszúságú szövegek szókincsének összehasonlítására anélkül, hogy a szöveg hossza törzítaná az eredményt.
ld <- textstat_lexdiv(toks15,
measure = c("TTR", "MATTR", "MTLD"))
# Attach metadata and drop the 'document' column for display
ld$President <- docvars(corp15, "President")
ld$Year <- docvars(corp15, "Year")
# Select only columns that actually exist
show_cols <- intersect(c("Year", "President", "TTR", "MATTR", "MTLD"), names(ld))
knitr::kable(ld[, show_cols],
digits = 3,
caption = "Lexical Diversity — Speeches Since 1965")| Year | President | TTR | MATTR |
|---|---|---|---|
| 1965 | Johnson | 0.637 | 0.853 |
| 1969 | Nixon | 0.604 | 0.834 |
| 1973 | Nixon | 0.495 | 0.743 |
| 1977 | Carter | 0.694 | 0.867 |
| 1981 | Reagan | 0.646 | 0.867 |
| 1985 | Reagan | 0.587 | 0.857 |
| 1989 | Bush | 0.589 | 0.846 |
| 1993 | Clinton | 0.648 | 0.856 |
| 1997 | Clinton | 0.560 | 0.836 |
| 2001 | Bush | 0.642 | 0.886 |
| 2005 | Bush | 0.618 | 0.862 |
| 2009 | Obama | 0.683 | 0.912 |
| 2013 | Obama | 0.661 | 0.902 |
| 2017 | Trump | 0.641 | 0.834 |
| 2021 | Biden | 0.557 | 0.843 |
| 2025 | Trump | 0.581 | 0.874 |
# Pivot only the measure columns that exist
measure_cols <- intersect(c("TTR", "MATTR", "MTLD"), names(ld))
ld_long <- tidyr::pivot_longer(ld,
cols = all_of(measure_cols),
names_to = "Measure",
values_to = "Score")
ggplot(ld_long, aes(x = factor(Year), y = Score,
group = Measure, colour = Measure)) +
geom_line(linewidth = 1) +
geom_point(size = 2.5) +
facet_wrap(~Measure, scales = "free_y", ncol = 1) +
labs(title = "Lexical Diversity Over Time",
x = "Year", y = "Score") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))textstat_readability() — Readability Scores / Olvashatósági
mutatókQuantifies how easy or difficult each speech is to read using classic formulas such as Flesch, Flesch-Kincaid, Gunning Fog, etc.
# Works on a corpus (needs sentence structure)
corp15_nostop <- corpus_subset(data_corpus_inaugural, Year >= 1965)
rd <- textstat_readability(corp15_nostop,
measure = c("Flesch",
"Flesch.Kincaid",
"FOG",
"SMOG"))
rd$President <- docvars(corp15_nostop, "President")
rd$Year <- docvars(corp15_nostop, "Year")
knitr::kable(rd[, c("Year","President","Flesch","Flesch.Kincaid","FOG","SMOG")],
digits = 2,
caption = "Readability Scores — Speeches Since 1965")| Year | President | Flesch | Flesch.Kincaid | FOG | SMOG |
|---|---|---|---|---|---|
| 1965 | Johnson | 69.41 | 7.56 | 10.41 | 10.36 |
| 1969 | Nixon | 65.58 | 9.24 | 12.05 | 11.13 |
| 1973 | Nixon | 54.19 | 12.30 | 15.20 | 13.10 |
| 1977 | Carter | 53.38 | 11.67 | 14.55 | 13.06 |
| 1981 | Reagan | 58.75 | 9.76 | 12.92 | 12.22 |
| 1985 | Reagan | 57.58 | 10.42 | 13.48 | 12.47 |
| 1989 | Bush | 73.10 | 7.15 | 9.98 | 9.88 |
| 1993 | Clinton | 55.81 | 10.38 | 13.20 | 12.37 |
| 1997 | Clinton | 59.22 | 9.83 | 12.69 | 11.96 |
| 2001 | Bush | 60.12 | 8.93 | 11.63 | 11.37 |
| 2005 | Bush | 53.19 | 11.04 | 14.11 | 13.02 |
| 2009 | Obama | 60.53 | 10.23 | 12.71 | 11.55 |
| 2013 | Obama | 53.56 | 11.73 | 14.51 | 12.95 |
| 2017 | Trump | 58.58 | 9.17 | 12.16 | 11.78 |
| 2021 | Biden | 73.20 | 5.78 | 8.74 | 9.37 |
| 2025 | Trump | 55.08 | 9.67 | 12.64 | 12.15 |
ggplot(rd, aes(x = Year, y = Flesch, label = President)) +
geom_smooth(method = "loess", se = TRUE,
colour = "#f46d43", fill = "#fee090", linewidth = 1) +
geom_point(colour = "#4393c3", size = 3) +
ggrepel::geom_text_repel(size = 3, max.overlaps = 6) +
labs(title = "Flesch Reading Ease Over Time",
subtitle = "Higher = easier to read",
x = "Year", y = "Flesch Score") +
theme_minimal(base_size = 13)textstat_dist() — Document Distance / Dokumentum
távolságComputes pairwise distances between documents (or features) in the DFM. Supports Euclidean, Manhattan, cosine, etc.
dist_mat <- textstat_dist(dfm15, method = "euclidean")
# Build unique labels: "Year President" to avoid duplicate name clashes
labels15 <- paste(docvars(corp15, "Year"), docvars(corp15, "President"))
dist_df <- as.matrix(dist_mat)
rownames(dist_df) <- labels15
colnames(dist_df) <- labels15
# Heatmap via ggplot
dist_long <- dist_df |>
as.data.frame() |>
tibble::rownames_to_column("Doc1") |>
tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Distance")
ggplot(dist_long, aes(x = Doc1, y = Doc2, fill = Distance)) +
geom_tile() +
scale_fill_distiller(palette = "RdYlBu", direction = -1) +
labs(title = "Euclidean Distance Between Speeches",
x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))textstat_simil() — Document Similarity / Dokumentum
hasonlóságThe counterpart to textstat_dist(): higher values mean
more similar documents. Cosine similarity is the most
popular choice for text.
sim_mat <- textstat_simil(dfm15, method = "cosine")
# Same unique labels as above
sim_df <- as.matrix(sim_mat)
rownames(sim_df) <- labels15
colnames(sim_df) <- labels15
sim_long <- sim_df |>
as.data.frame() |>
tibble::rownames_to_column("Doc1") |>
tidyr::pivot_longer(-Doc1, names_to = "Doc2", values_to = "Similarity")
ggplot(sim_long, aes(x = Doc1, y = Doc2, fill = Similarity)) +
geom_tile() +
scale_fill_distiller(palette = "YlGn", direction = 1) +
labs(title = "Cosine Similarity Between Speeches",
x = NULL, y = NULL) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))textstat_keyness() — Keyness Analysis /
Kulcsszó-elemzésIdentifies terms that are significantly more (or less) frequent in a target set compared to a reference set. Uses chi-squared or log-likelihood tests.
# Target: 21st-century speeches; Reference: 20th-century speeches
dfm_cent <- dfm_group(dfm_all, groups = century)
key <- textstat_keyness(dfm_cent, target = "21st")
head(key, 15) |>
knitr::kable(digits = 3,
caption = "Keyness: 21st-Century vs 20th-Century Speeches")| feature | chi2 | p | n_target | n_reference |
|---|---|---|---|---|
| america | 192.347 | 0 | 88 | 132 |
| thank | 187.142 | 0 | 34 | 11 |
| americans | 104.360 | 0 | 40 | 50 |
| story | 100.626 | 0 | 18 | 5 |
| jobs | 73.151 | 0 | 12 | 2 |
| today | 53.647 | 0 | 37 | 78 |
| america’s | 50.472 | 0 | 18 | 19 |
| borders | 44.372 | 0 | 9 | 3 |
| nation | 42.932 | 0 | 72 | 252 |
| day | 42.776 | 0 | 30 | 64 |
| workers | 40.756 | 0 | 7 | 1 |
| raging | 40.288 | 0 | 6 | 0 |
| american | 40.158 | 0 | 47 | 138 |
| back | 39.316 | 0 | 18 | 25 |
| president | 38.315 | 0 | 31 | 73 |
# quanteda.textplots provides a dedicated keyness plot
if (requireNamespace("quanteda.textplots", quietly = TRUE)) {
quanteda.textplots::textplot_keyness(key,
n = 15,
labelcolor = "grey30") +
labs(title = "Keyness Plot — 21st vs 20th Century") +
theme_minimal(base_size = 12)
} else {
# Fallback ggplot version
key_top <- rbind(head(key, 10), tail(key, 10))
key_top$direction <- ifelse(key_top$chi2 > 0, "21st century", "20th century")
ggplot(key_top, aes(x = reorder(feature, chi2), y = chi2, fill = direction)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("21st century" = "#1a9641",
"20th century" = "#d7191c")) +
labs(title = "Keyness: Top Distinctive Terms",
x = NULL, y = "Chi-squared statistic", fill = NULL) +
theme_minimal(base_size = 12)
}textstat_collocations() — Collocations / Szavak
egybeeséseiFinds multi-word expressions that appear together more often than chance. Useful for discovering idioms, named entities, and technical phrases.
# Run on the tokens object (before stopword removal for natural phrases)
toks_raw <- tokens(corp, remove_punct = TRUE)
col <- textstat_collocations(toks_raw,
size = 2, # bigrams
min_count = 5) # at least 5 occurrences
head(col, 20) |>
knitr::kable(digits = 3,
caption = "Top Bigram Collocations (λ statistic)")| collocation | count | count_nested | length | lambda | z |
|---|---|---|---|---|---|
| of the | 1786 | 0 | 2 | 1.563 | 53.069 |
| it is | 327 | 0 | 2 | 3.541 | 51.057 |
| has been | 188 | 0 | 2 | 5.200 | 50.397 |
| have been | 209 | 0 | 2 | 4.758 | 49.254 |
| those who | 130 | 0 | 2 | 5.817 | 45.564 |
| we have | 270 | 0 | 2 | 3.371 | 45.065 |
| united states | 165 | 0 | 2 | 7.935 | 43.578 |
| of our | 635 | 0 | 2 | 2.029 | 41.981 |
| will be | 220 | 0 | 2 | 3.337 | 41.329 |
| in the | 828 | 0 | 2 | 1.709 | 40.143 |
| let us | 101 | 0 | 2 | 6.342 | 37.517 |
| should be | 140 | 0 | 2 | 4.301 | 37.515 |
| we are | 187 | 0 | 2 | 3.127 | 36.252 |
| we will | 202 | 0 | 2 | 2.971 | 36.225 |
| may be | 126 | 0 | 2 | 4.043 | 35.165 |
| fellow citizens | 79 | 0 | 2 | 7.822 | 34.764 |
| i shall | 96 | 0 | 2 | 4.323 | 34.030 |
| we must | 128 | 0 | 2 | 3.688 | 33.173 |
| must be | 117 | 0 | 2 | 3.782 | 33.094 |
| there is | 104 | 0 | 2 | 4.150 | 32.331 |
col3 <- textstat_collocations(toks_raw, size = 3, min_count = 3)
head(col3, 10) |>
knitr::kable(digits = 3,
caption = "Top Trigram Collocations")| collocation | count | count_nested | length | lambda | z |
|---|---|---|---|---|---|
| of which the | 11 | 0 | 3 | 3.065 | 7.970 |
| all of us | 15 | 0 | 3 | 4.455 | 7.394 |
| in which the | 14 | 0 | 3 | 2.486 | 7.161 |
| than that of | 8 | 0 | 3 | 4.966 | 6.965 |
| is not the | 15 | 0 | 3 | 2.523 | 6.733 |
| is that of | 5 | 0 | 3 | 3.498 | 6.448 |
| as that of | 4 | 0 | 3 | 4.885 | 6.393 |
| the american people | 40 | 0 | 3 | 5.651 | 6.307 |
| of president of | 6 | 0 | 3 | 4.783 | 6.269 |
| to that of | 4 | 0 | 3 | 3.389 | 6.222 |
# 1. Előkészít#és
toks <- tokens(data_corpus_inaugural, remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english"))
# 2. Co-occurrence mátrix (FCM) létrehozása 5 szavas ablakkal
fcmat <- fcm(toks, context = "window", window = 5)
# 3. Csak a 50 leggyakoribb szó kiválasztása a jobb olvashatóságért
# DFM létrehozása és a leggyakoribb szavak kigyűjtése
dfmat <- dfm(toks)
feat <- names(topfeatures(dfmat, 50)) # Itt még működik a topfeatures
# Az FCM szűrése a DFM alapján kapott szavakra
fcm_subset <- fcm_select(fcmat, pattern = feat)
# 4. Hálózati diagram kirajzolása
library(ggplot2)
textplot_network(fcm_subset,
min_freq = 0.9,
vertex_labelsize = 5,
vertex_color = "#E41A1C", # Pirosas csomópontok
edge_color = "#377EB8", # Kékes élek
edge_alpha = 0.4, # Halvány élek a jobb olvashatóságért
vertex_size = colSums(fcm_subset)/max(colSums(fcm_subset)) * 5) +
labs(title = "Szókapcsolatok hálózata az elnöki beszédekben",
hjust = 0.5,
subtitle = "Az 50 leggyakoribb szó alapján ('stopwords' nélkül)",
caption = "Forrás: data_corpus_inaugural")textstat_entropy() — Shannon entropy / Shannon-féle
entrópiaShannon entropy measures the diversity of term usage across documents: high entropy → terms spread evenly; low entropy → concentrated in few docs. A szavak entrópiájának vizsgálata: egy szó, amely csak egy dokumentumban fordul elő, alacsony entrópiával rendelkezik, míg egy olyan szó, amely sok dokumentumban egyenletesen eloszlik, magas entrópiával bír és csak néhány dokumentumban koncentrálódik.
A Shannon-entrópia egy vélhetően bekövetkező esemény
bizonytalanságát méri. A képlete a következő:
\[H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)\]
Ahol: - H(X) az entrópia, - P(x_i) az x_i esemény valószínűsége. A függvény akkor éri el a maximumát, amikor a rendszer a lehető legkevésbé jósolható előre, azaz egyenletes az eloszlás. Ez azt jelenti, hogy minden lehetséges kimenetel pontosan ugyanakkora valószínűséggel következik be. Képletben: p1 = p2 = p3 = … = pn = 1/n A maximum értékéhez a feltételt a képletbe helyettesítve: \[H_{max} = \log_{2}(n) \text{ bit}\] A Shannon-entrópia értéke mindig nem negatív, és a rendszer bizonytalanságának mértékét jelzi. Minél magasabb az entrópia, annál nagyobb a bizonytalanság, és annál több információt tartalmaz a rendszer. A legegyszerűbb példa az érmefeldobás: egy tisztességes érme esetén két kimenetel van (fej vagy írás), mindkettő valószínűsége 0,5. Ebben az esetben a Shannon-entrópia értéke: \[H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \text{ bit}\] Ez azt jelenti, hogy egy tisztességes érme feldobása 1 bit információt tartalmaz, és a kimenetel teljesen bizonytalan. Ha az érme csaló lenne, és mindig fej lenne, akkor a Shannon-entrópia értéke 0 lenne, mivel nincs bizonytalanság: \[H(X) = - (1 \log_2 1 + 0 \log_2 0) = 0 \text{ bit}\]
ent <- textstat_entropy(dfm_all, margin = "features")
ent_top <- ent |>
arrange(desc(entropy)) |>
head(20)
ggplot(ent_top, aes(x = reorder(feature, entropy), y = entropy)) +
geom_col(fill = "#7b3294", alpha = .8) +
coord_flip() +
labs(title = "Top 20 Features by Entropy",
subtitle = "High entropy = term used across many documents evenly",
x = NULL, y = "Shannon Entropy") +
theme_minimal(base_size = 12)# Entropy across features within each document
ent_doc <- textstat_entropy(dfm15, margin = "documents")
ent_doc$President <- docvars(corp15, "President")
ggplot(ent_doc, aes(x = reorder(President, entropy), y = entropy)) +
geom_point(colour = "#e66101", size = 4) +
geom_segment(aes(xend = reorder(President, entropy), yend = 0),
colour = "#e66101", linewidth = .8) +
coord_flip() +
labs(title = "Document-Level Entropy",
x = NULL, y = "Entropy") +
theme_minimal(base_size = 12)textstat_summary() — Corpus-Tokens-DFM Summary /
Corpus-Tokens-DFM összefoglalóA quick diagnostic function returning token counts, type counts, sentences, and other metadata for each document.
summ_corp <- textstat_summary(corp15)
knitr::kable(summ_corp,
caption = "Corpus Summary — Speeches Since 1965",
digits = 1)| document | chars | sents | tokens | types | puncts | numbers | symbols | urls | tags | emojis |
|---|---|---|---|---|---|---|---|---|---|---|
| 1965-Johnson | 8205 | 93 | 1710 | 535 | 221 | 3 | 0 | 0 | 0 | 0 |
| 1969-Nixon | 11644 | 103 | 2416 | 714 | 292 | 0 | 0 | 0 | 0 | 0 |
| 1973-Nixon | 10007 | 68 | 1995 | 515 | 193 | 1 | 0 | 0 | 0 | 0 |
| 1977-Carter | 6878 | 52 | 1370 | 501 | 146 | 3 | 0 | 0 | 0 | 0 |
| 1981-Reagan | 13743 | 129 | 2781 | 850 | 349 | 1 | 0 | 0 | 0 | 0 |
| 1985-Reagan | 14572 | 123 | 2909 | 876 | 345 | 11 | 0 | 0 | 0 | 0 |
| 1989-Bush | 12529 | 141 | 2674 | 756 | 357 | 2 | 0 | 0 | 0 | 0 |
| 1993-Clinton | 9113 | 81 | 1833 | 605 | 235 | 0 | 0 | 0 | 0 | 0 |
| 1997-Clinton | 12262 | 111 | 2436 | 726 | 279 | 0 | 0 | 0 | 0 | 0 |
| 2001-Bush | 9054 | 97 | 1806 | 592 | 222 | 1 | 0 | 0 | 0 | 0 |
| 2005-Bush | 11923 | 99 | 2312 | 734 | 241 | 0 | 0 | 0 | 0 | 0 |
| 2009-Obama | 13460 | 110 | 2689 | 900 | 299 | 0 | 0 | 0 | 0 | 0 |
| 2013-Obama | 11917 | 88 | 2317 | 786 | 220 | 5 | 0 | 0 | 0 | 0 |
| 2017-Trump | 8433 | 88 | 1660 | 547 | 215 | 2 | 0 | 0 | 0 | 0 |
| 2021-Biden | 13133 | 216 | 2766 | 744 | 394 | 6 | 0 | 0 | 0 | 0 |
| 2025-Trump | 17077 | 177 | 3347 | 950 | 434 | 4 | 0 | 0 | 0 | 0 |
summ_tok <- textstat_summary(toks15)
knitr::kable(summ_tok,
caption = "Tokens Summary (after preprocessing)",
digits = 1)| document | chars | sents | tokens | types | puncts | numbers | symbols | urls | tags | emojis |
|---|---|---|---|---|---|---|---|---|---|---|
| 1965-Johnson | NA | NA | 691 | 440 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1969-Nixon | NA | NA | 1028 | 621 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1973-Nixon | NA | NA | 851 | 421 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1977-Carter | NA | NA | 592 | 411 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1981-Reagan | NA | NA | 1146 | 740 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1985-Reagan | NA | NA | 1291 | 758 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1989-Bush | NA | NA | 1092 | 643 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1993-Clinton | NA | NA | 798 | 517 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1997-Clinton | NA | NA | 1130 | 633 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2001-Bush | NA | NA | 783 | 503 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2005-Bush | NA | NA | 1041 | 643 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2009-Obama | NA | NA | 1173 | 801 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2013-Obama | NA | NA | 1031 | 681 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2017-Trump | NA | NA | 713 | 457 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2021-Biden | NA | NA | 1127 | 628 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2025-Trump | NA | NA | 1448 | 842 | 0 | 0 | 0 | 0 | 0 | 0 |
A Zipf-szabály egy statisztikai törvény, amely legérthetőbb megfogalmazása szerint egy adott rendszerben (például egy nyelvben) az elemek gyakorisága és a gyakorisági rangsorban elfoglalt helyük között fordított arányosság áll fenn. A szabály azt mutatja meg, hogy a világban sok látszólag kaotikus rendszer (mint az internetes forgalom, vagy a jövedelmek eloszlása) valójában egy szigorú matematikai mintát követ.
#Install the zipfR package
#install.packages("zipfR")
#Load the package
library(zipfR)
#Load necessary libraries
library(ggplot2)
#Define parameters
N <- 100 # Total number of elements
s <- 1.5 # Shape parameter
#Generate Zipf distribution probabilities
zipf_probs <- (1 / (1:N)^s) / sum(1 / (1:N)^s)
zipf_data <- data.frame(Rank = 1:N, Probability = zipf_probs)
#Display the first few rows
print(zipf_data)## Rank Probability
## 1 1 0.4144435056
## 2 2 0.1465279066
## 3 3 0.0797596898
## 4 4 0.0518054382
## 5 5 0.0370689541
## 6 6 0.0281993088
## 7 7 0.0223778459
## 8 8 0.0183159883
## 9 9 0.0153497595
## 10 10 0.0131058544
## 11 11 0.0113599471
## 12 12 0.0099699612
## 13 13 0.0088419959
## 14 14 0.0079117633
## 15 15 0.0071339235
## 16 16 0.0064756798
## 17 17 0.0059127832
## 18 18 0.0054269595
## 19 19 0.0050042032
## 20 20 0.0046336193
## 21 21 0.0043066184
## 22 22 0.0040163478
## 23 23 0.0037572802
## 24 24 0.0035249136
## 25 25 0.0033155480
## 26 26 0.0031261176
## 27 27 0.0029540626
## 28 28 0.0027972307
## 29 29 0.0026538009
## 30 30 0.0025222229
## 31 31 0.0024011694
## 32 32 0.0022894985
## 33 33 0.0021862228
## 34 34 0.0020904846
## 35 35 0.0020015354
## 36 36 0.0019187199
## 37 37 0.0018414620
## 38 38 0.0017692530
## 39 39 0.0017016429
## 40 40 0.0016382318
## 41 41 0.0015786634
## 42 42 0.0015226196
## 43 43 0.0014698149
## 44 44 0.0014199934
## 45 45 0.0013729242
## 46 46 0.0013283992
## 47 47 0.0012862298
## 48 48 0.0012462452
## 49 49 0.0012082901
## 50 50 0.0011722233
## 51 51 0.0011379157
## 52 52 0.0011052495
## 53 53 0.0010741169
## 54 54 0.0010444188
## 55 55 0.0010160646
## 56 56 0.0009889704
## 57 57 0.0009630594
## 58 58 0.0009382603
## 59 59 0.0009145076
## 60 60 0.0008917404
## 61 61 0.0008699025
## 62 62 0.0008489416
## 63 63 0.0008288091
## 64 64 0.0008094600
## 65 65 0.0007908522
## 66 66 0.0007729465
## 67 67 0.0007557065
## 68 68 0.0007390979
## 69 69 0.0007230889
## 70 70 0.0007076496
## 71 71 0.0006927520
## 72 72 0.0006783699
## 73 73 0.0006644787
## 74 74 0.0006510551
## 75 75 0.0006380775
## 76 76 0.0006255254
## 77 77 0.0006133795
## 78 78 0.0006016216
## 79 79 0.0005902347
## 80 80 0.0005792024
## 81 81 0.0005685096
## 82 82 0.0005581418
## 83 83 0.0005480854
## 84 84 0.0005383273
## 85 85 0.0005288554
## 86 86 0.0005196581
## 87 87 0.0005107242
## 88 88 0.0005020435
## 89 89 0.0004936059
## 90 90 0.0004854020
## 91 91 0.0004774229
## 92 92 0.0004696600
## 93 93 0.0004621053
## 94 94 0.0004547509
## 95 95 0.0004475895
## 96 96 0.0004406142
## 97 97 0.0004338182
## 98 98 0.0004271951
## 99 99 0.0004207388
## 100 100 0.0004144435
#Basic Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
geom_line(color = "brown", size = .75) +
labs(title = "Basic Zipf Distribution",
x = "Rank",
y = "Probability") +
theme_minimal()#Log10 Zipf distribution plot
ggplot(zipf_data, aes(x = Rank, y = Probability)) +
geom_line(color = "brown", size = .75) +
scale_x_log10() +
scale_y_log10() +
labs(title = "Log/log Scale Zipf Distribution",
x = "Rank",
y = "Probability") +
theme_minimal()#— Letöltés & corpus pipeline —
library(gutenbergr)
austen_muvek <- gutenberg_download(
c(1342, 161, 141, 105),
meta_fields = "title"
)
austen_corpus <- austen_muvek %>%
group_by(title) %>%
summarise(text = paste(text, collapse = " ")) %>%
corpus(text_field = "text", docid_field = "title")
austen_dfm <- austen_corpus %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("en")) %>%
dfm()#--- Top szavak kinyerése ---
top_words <- topfeatures(austen_dfm, n = 30, groups = docnames(austen_dfm))
muvek_lista <- lapply(names(top_words), function(mu) {
data.frame(
Szó = names(top_words[[mu]]),
Gyakoriság = unname(top_words[[mu]])
)
})
tabla_egyutt <- do.call(cbind, muvek_lista)
rownames(tabla_egyutt) <- 1:30
#Fejléc: rövidített műcímek (helytakarékosság)
cimek <- c("Pride & Prejudice", "Sense & Sensibility", "Mansfield Park", "Persuasion")
col_nevek <- rep(c("Szó", "N"), 4)
tabla_egyutt %>%
kable("html",
caption = "Top 30 szó – 4 Austen-mű",
col.names = col_nevek) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
add_header_above(setNames(rep(2, 4), cimek)) %>%
row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")| Szó | N | Szó | N | Szó | N | Szó | N |
|---|---|---|---|---|---|---|---|
| fanny | 814 | anne | 446 | mr | 807 | elinor | 619 |
| must | 498 | captain | 302 | elizabeth | 605 | mrs | 529 |
| crawford | 492 | mrs | 291 | said | 406 | marianne | 489 |
| mr | 481 | mr | 254 | darcy | 383 | said | 397 |
| much | 456 | elliot | 254 | mrs | 353 | every | 375 |
| miss | 431 | one | 230 | much | 335 | one | 304 |
| mrs | 408 | must | 227 | miss | 315 | much | 290 |
| said | 406 | lady | 214 | must | 312 | must | 278 |
| sir | 372 | much | 205 | bennet | 309 | time | 237 |
| one | 365 | wentworth | 191 | one | 284 | know | 231 |
| edmund | 361 | little | 176 | jane | 273 | dashwood | 229 |
| think | 349 | said | 173 | bingley | 262 | now | 223 |
| now | 332 | good | 169 | know | 241 | well | 217 |
| might | 324 | might | 165 | though | 238 | edward | 217 |
| little | 309 | now | 157 | never | 228 | sister | 216 |
| time | 306 | never | 154 | think | 222 | might | 215 |
| well | 298 | charles | 153 | can | 218 | though | 215 |
| nothing | 298 | time | 151 | well | 217 | miss | 209 |
| thomas | 289 | sir | 149 | soon | 216 | think | 208 |
| good | 279 | think | 149 | now | 208 | mother | 203 |
| never | 278 | well | 145 | might | 206 | can | 199 |
| without | 269 | nothing | 138 | may | 203 | jennings | 199 |
| know | 251 | great | 128 | time | 200 | never | 186 |
| can | 244 | know | 126 | lady | 190 | nothing | 183 |
| bertram | 240 | miss | 125 | little | 190 | thing | 183 |
| first | 237 | see | 123 | nothing | 185 | soon | 179 |
| soon | 222 | man | 123 | every | 185 | mr | 178 |
| see | 219 | walter | 123 | without | 177 | willoughby | 178 |
| though | 216 | soon | 122 | sister | 177 | without | 173 |
| every | 214 | mary | 121 | good | 174 | may | 172 |
szinek <- c(
"Pride and Prejudice" = "#e74c3c",
"Sense and Sensibility" = "#2980b9",
"Mansfield Park" = "#27ae60",
"Persuasion" = "#8e44ad"
)
freq_muvek <- textstat_frequency(austen_dfm, groups = docnames(austen_dfm)) %>%
rename(mu = group)
plot_ly() %>%
{
p <- .
for (m in names(szinek)) {
df_m <- filter(freq_muvek, mu == m)
p <- add_trace(p,
data = df_m,
x = ~rank,
y = ~frequency,
type = "scatter", mode = "markers",
name = m,
marker = list(size = 3, opacity = 0.5,
color = szinek[[m]]),
hovertemplate = paste0("<b>", m, "</b><br>",
"szó: %{text}<br>",
"rang: %{x}<br>",
"előfordulás: %{y}<extra></extra>"),
text = ~feature
)
}
p
} %>%
plotly::layout(
title = "Zipf-eloszlás (eredeti skála) – 4 Austen-mű",
xaxis = list(title = "Rang"),
yaxis = list(title = "Gyakoriság (előfordulások száma)"),
legend = list(title = list(text = "<b>Mű</b>")),
template = "plotly_white"
)#Get the data
library(readxl)
#Make frequency tables
library(tidyverse)
evm <- read_xlsx("Mt_short.xlsx")
freqtab1 <- evm %>% count(FullWord, sort=TRUE)
top50Mt <- freqtab1[1:50,]
Mt_total <- sum(freqtab1$n)
#
evm <- read_xlsx("Mk_short.xlsx")
freqtab2 <- evm %>% count(FullWord, sort=TRUE)
top50Mk <- freqtab2[1:50,]
Mk_total <- sum(freqtab2$n)
#
evm <- read_xlsx("Lk_short.xlsx")
freqtab3 <- evm %>% count(FullWord, sort=TRUE)
top50Lk <- freqtab3[1:50,]
Lk_total <- sum(freqtab3$n)
#
evm <- read_xlsx("Jn_short.xlsx")
freqtab4 <- evm %>% count(FullWord, sort=TRUE)
top50Jn <- freqtab4[1:50,]
Jn_total <- sum(freqtab4$n)
#
evmtab50 <- cbind(top50Mt,top50Mk,top50Lk,top50Jn)
names(evmtab50) <- c("Szó(Mt)","n","Szó(Mk)","n","Szó(Lk)","n","Szó(Jn)","n")
evmtab50#install.packages("plotly")
library(plotly)
datus <- data.frame(Roll_number = 1:50,
y1 = top50Mt$n,
y2 = top50Mk$n,
y3 = top50Lk$n,
y4 = top50Jn$n)
#
fig <-plotly::plot_ly(data = datus, x = ~Roll_number,
y = ~y1, name = "Mt",
type = "scatter",mode = "lines") %>%
add_trace(y = ~y2, name = "Mk") %>%
add_trace(y = ~y4, name = "Jn") %>%
add_trace(y = ~y3, name = "Lk") %>%
layout(title = 'Zipfs law and the gospels', xaxis = list(title = 'Helyezés'),
yaxis = list(title = 'Előfordulás'), legend = list(title=list(text='Legend Title')))
fig
``` r
library(readr)
bigs <- read_csv("largest-cities-by-population-2026.csv")
library(reactable)
reactable(bigs,
searchable = TRUE, # keresőmező
filterable = TRUE, # oszloponkénti szűrő
striped = TRUE, # csíkos sorok
highlight = TRUE, # hover kiemelés
defaultPageSize = 25, # sorok oldalanként
height = 500 # görgethető magasság
)| Function | Input | What it computes |
|---|---|---|
textstat_frequency() |
DFM | Term frequencies & document frequencies |
textstat_lexdiv() |
Tokens | Lexical diversity (TTR, MATTR, MTLD, …) |
textstat_readability() |
Corpus | Readability indices (Flesch, FOG, SMOG, …) |
textstat_dist() |
DFM | Pairwise document/feature distances |
textstat_simil() |
DFM | Pairwise document/feature similarities |
textstat_keyness() |
DFM | Keyness of terms in target vs reference |
textstat_collocations() |
Tokens | Multi-word collocations (λ, z-scores) |
textstat_entropy() |
DFM | Shannon entropy per doc or feature |
textstat_summary() |
Corpus/Tokens/DFM | Token, type & sentence counts per doc |