@locusclassicus for https://antibarbari.com/ Version 06/09/2021
In this project, we’ll try to make a list of most frequent words (mfw) used by Gorgias of Leontini (tlg0593) in his Encomium of Helen. This list can further be used to compile a vocabulary or exported to quizlet. Frequencies can also be used for stylometric purposes, but this is beyond the scope of this project.
To make a word list, we need a lemmatized text, which can be found at github.com/gcelano/LemmatizedAncientGreekXML. This repository contains Ancient Greek texts which have been tokenized, POS-tagged, sentence-splitted, and lemmatized automatically.
In the directory “texts” we find the text we need, and download it
url <- "https://raw.githubusercontent.com/gcelano/LemmatizedAncientGreekXML/master/texts/tlg0593.1st1K001.1st1K-grc1.xml"
file_name <- "Gorgias_Helen.xml"
download.file(url, file_name)
We then parse the file and extract the lemmata (or wordforms, if lemmata are missing) stored either under l1 or l2 or f tags
library(XML)
doc <- xmlTreeParse(file_name, useInternalNodes = T)
rootnode <- xmlRoot(doc)
all_tokens <- getNodeSet(rootnode, "//t")
l <- length(all_tokens)
n <- c(1:l)
lemmata <- character()
for (i in n) {
if (length(sapply(all_tokens[[i]][["l"]]["l1"], xmlValue)) > 0) {
lemmata <- c(lemmata, sapply(all_tokens[[i]][["l"]]["l1"], xmlValue))
} else if (length(sapply(all_tokens[[i]][["l"]]["l2"], xmlValue)) > 0) {
lemmata <- c(lemmata, sapply(all_tokens[[i]][["l"]]["l2"], xmlValue))
} else { lemmata <- c(lemmata, sapply(all_tokens[[i]]["f"], xmlValue))
}
}
length(lemmata)
## [1] 1544
class(lemmata)
## [1] "character"
head(lemmata, n = 21)
## l1 l1 l1 l1 l2 l1 l1
## "κόσμος" "πόλις" "μέν" "εὐανδρία" "," "σῶμα" "δέ"
## l1 l2 l1 l1 f l2 l1
## "κάλλος" "," "ψυχή" "δέ" "σοφία" "," "πρᾶγμα"
## l1 l1 l2 l1 l1 l1 l2
## "δέ" "ἀρετή" "," "λόγος" "δέ" "ἀλήθεια" "·"
This is quite a short text, and the opening looks familiar!
In the word list we get upon the retrieval of lemmata (and wordforms, where lemmata are missing), there are several lemmatisation errors that need to be fixed. We also want to delete punctuation marks as those won’t be taken into consideration.
Let’s first see how many unique entries we have before cleaning up
length(unique(lemmata))
## [1] 456
The following transformations after a visual inspection of the data. Lots of participles need to be stored under corresponding verbs!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
lemmata_clean <- str_remove_all(lemmata, "[.·\\]\\[,!;]")
lemmata_clean <- tolower(lemmata_clean)
lemmata_clean <- str_replace(lemmata_clean, "δ᾿", "δέ")
lemmata_clean <- str_replace(lemmata_clean, "τίη", "ἤ")
lemmata_clean <- str_replace(lemmata_clean, "γραφείς", "γράφω")
lemmata_clean <- str_replace(lemmata_clean, "μετεωρολόγων", "μετεωρολόγος")
lemmata_clean <- str_replace(lemmata_clean, "προσοῦσα", "πρόσειμι")
lemmata_clean <- str_replace(lemmata_clean, "τῷ", "ὁ")
lemmata_clean <- str_replace(lemmata_clean, "ἠπάτα", "ἀπατάω")
lemmata_clean <- str_replace(lemmata_clean, "παροιχ.μ....", "παροίχομαι")
lemmata_clean <- str_replace(lemmata_clean, "δυσπραγίαις", "δυσπραγία")
lemmata_clean <- str_replace(lemmata_clean, "ὅς2", "ὅς")
lemmata_clean <- str_replace(lemmata_clean, "σφαλεραῖς", "σφαλερός")
lemmata_clean <- str_replace(lemmata_clean, "ὀρφανισθεῖσα", "ὀρφανίζω")
lemmata_clean <- str_replace(lemmata_clean, "βιασθεῖσα", "βιάζω")
lemmata_clean <- str_replace(lemmata_clean, "ὑβρισθεῖσα", "ὑβρίζω")
lemmata_clean <- str_replace(lemmata_clean, "ἁρπασθεῖσα", "ἁρπάζω")
lemmata_clean <- str_replace(lemmata_clean, "κακολογηθείη", "κακολογέω")
lemmata_clean <- str_replace(lemmata_clean, "δέον", "δέω")
lemmata_clean <- str_replace(lemmata_clean, "αἰτιώμενος", "αἰτιάομαι")
lemmata_clean <- str_replace(lemmata_clean, "ἀπολυτέον", "ἀπολύω")
lemmata_clean <- str_replace(lemmata_clean, "ἀπέπλησε", "ἀποπίμπλημι")
lemmata_clean <- str_replace(lemmata_clean, "δέον", "δέω")
lemmata_clean <- str_replace(lemmata_clean, "ἀλλ᾿", "ἀλλά")
lemmata_clean <- str_replace(lemmata_clean, "μεμφομένους", "μέμφομαι")
lemmata_clean <- str_replace(lemmata_clean, "κράτιστος", "κρείσσων")
Let’s check again the number of unique entries
length(unique(lemmata_clean))
## [1] 432
The number of unique entries has now decreased. However, it should be remembered that we might need to disambiguate certain entries when compiling the vocabulary. Minor corrections might be needed, too, at a later stage, if we use a critical edition different from that used for lemmatisation.
We now want to make a frequency list to exclude from our lexical minimum words that only occur once in this text.
library(stylo)
##
## ### stylo version: 0.7.4 ###
##
## If you plan to cite this software (please do!), use the following reference:
## Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
## a package for computational text analysis. R Journal 8(1): 107-121.
## <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>
##
## To get full BibTeX entry, type: citation("stylo")
freq <- make.frequency.list(lemmata_clean, relative = FALSE, value = TRUE)
Here’s how the head of this list looks like (note that we used absolute, not relative frequencies). Functional words are the most frequent, but one also notes that πείθω and ψυχή are prominent, too, which is something to be expected from Gorgias!The blank space (second mfw) is actually left after we deleted punctuation. We’ll remove it later.
head(freq, n = 20)
## data
## ὁ καί δέ ὅς λόγος μέν γάρ ὡς εἰμί ἔχω οὐ πείθω
## 205 174 91 62 48 34 32 19 16 14 13 13 13
## τε ψυχή διά εἰ σῶμα οὖν αἰτία
## 13 13 12 12 12 11 9
Freq is a long table with column names containing specific words.
dim(freq)
## [1] 432
head(names(freq))
## [1] "ὁ" "" "καί" "δέ" "ὅς" "λόγος"
For our lexical minimum, we want to subset words that feature at least twice in this text. This can be done in many ways, for instance:
library(dplyr)
freq_t <- tibble::as_tibble(freq)
head(freq_t)
## # A tibble: 6 x 2
## data n
## <chr> <int>
## 1 "ὁ" 205
## 2 "" 174
## 3 "καί" 91
## 4 "δέ" 62
## 5 "ὅς" 48
## 6 "λόγος" 34
freq_t = freq_t[-c(2),] # removes a row with empty space we obtained by deleting punctuation marks
names <- c("words", "frq")
colnames(freq_t) <- names # assigns names to columns
mfw <- filter(freq_t, frq >= 2) #filters by row value
dim(mfw) # how many words are there in the list?
## [1] 163 2
We may now want to save the list of mfw to use later in our commentary to Gorgias.
write.table(mfw$words, file = "mfw.txt", fileEncoding="utf-8", quote = FALSE, row.names = FALSE, col.names = FALSE) #this function saves a file into the working directory
One can similarly subset and save words that occur only once in our text
hapax <- filter(freq_t, frq < 2)
write.table(hapax$words, file = "hapax.txt", fileEncoding="utf-8", quote = FALSE, row.names = FALSE, col.names = FALSE)
Both lists are now in our folder, and can be used for further research! Just for fun, let’s see which frequencies occur most in our text. Let’s use relative frequencies this time.
freq_rel <- make.frequency.list(lemmata_clean, relative = TRUE, value = TRUE)
freq_rel_t <- tibble::as_tibble(freq_rel)
freq_rel_t = freq_rel_t[-c(2),]
colnames(freq_rel_t) <- names
freq_rel_c <- freq_rel_t %>% count(frq, sort = TRUE) ## counts the number of words with the given frequency
names2 <- c("relative_frequency", "number_of_words")
colnames(freq_rel_c) <- names2
library(ggplot2)
ggplot(freq_rel_c, aes(x = number_of_words, y = relative_frequency))+geom_point(color = "darkblue", size = 3, alpha = 0.7)+lims(y = c(0, 4.5))+lims(x = c(0, 40))
## Warning: Removed 4 rows containing missing values (geom_point).
Well, nothing surprising here: there are very few words with high frequency! Note that we had to exclude some observations from the plot in order to ‘zoom in’ its most populated part.