Three Platos: Measuring Distances

Synopsis

In our previous series of experiments, we have chosen the following combinations of three “Platos” to be discriminated by a machine-learning classifier.

  • Set 1: Plato 1 (Ly., Chrm.) vs. Plato 2 (R. 2 and 3) vs. Plato 3 (Lg 1. and 2)

  • Set 2: Plato 1 (Prt., Grg.) vs. Plato 2 (R. 8 and 9) vs. Plato 3 (Lg. 8 and 9)

To each Plato, only 2 dialogues were assigned, and with 1000 word blocks and 70 to 100 most frequent words (mfw) the classifier (Delta) was able to correctly assign a sample from a text to its corresponding author. Though in fact authored by one person (presumably with some chronological gap), the texts in each group “look different” (statistically) for the classifier, and we intend to use our Platos for further comparison with a test dialogue which we suspect to have been revised. However, before we do so, it is advisable to have a closer look at our reference Platos. We are here interested in the specific delta distances between them. In sum, we come to the conclusion that we need to redefine our Set 1, for it now includes in one group texts with delta > 1.

Corpus

For these experiments, I used Diorisis Ancient Greek Corpus [@vatri2018]. On the accuracy of lemmatization, see [@vatri2020]. The code I used for extracting the lemmata is accessible via links: Parsing Plato’s Republic (Separate Books), Parsing Plato’s Laws (Separate Books), Corpus_Platonicum: Lemmata Extraction. I start with the files produced by this code in my working directory.

Packages

library(stylo) 
library(dendextend)

Load and Parse

dir.create("corpus1")
file.copy(c("Protagoras.txt", "Gorgias.txt", "Republic2.txt", "Republic3.txt", "Republic8.txt", "Republic9.txt", "Laws1.txt", "Laws2.txt","Laws8.txt","Laws9.txt", "Lysis.txt","Charmides.txt"), "corpus1")
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
my_corpus1 <- load.corpus.and.parse(files = "all", corpus.dir = "corpus1", markup.type= "plain", corpus.lang = "Other", sampling = "no.sampling", preserve.case = FALSE, encoding = "UTF-8")

Most Frequent Words

my_freq <- make.frequency.list(my_corpus1, value = TRUE)
dim(my_freq) 
## [1] 5498
head(my_freq)
## data
##        ὁ      καί     εἰμί       δέ    οὗτος      ἐγώ 
## 9.137414 5.673112 2.735435 2.676744 1.869359 1.780531

For our analysis, we want to chose 100 mfw from the corpus. My_freq presents frequencies in descending order, so we basically need to subset the first 100 elements.

mfw100 <- my_freq[1:100]
names(mfw100)
##   [1] "ὁ"        "καί"      "εἰμί"     "δέ"       "οὗτος"    "ἐγώ"     
##   [7] "οὐ"       "τε"       "αὐτός"    "ἄν"       "μέν"      "ἠέ"      
##  [13] "ὅς"       "τις"      "λέγω"     "φημί"     "γάρ"      "ἐν"      
##  [19] "σύ"       "ἀλλά"     "γε"       "ἄλλος"    "ὅστις"    "μή"      
##  [25] "δή"       "ὡς"       "τίς"      "οὖν"      "γίγνομαι" "εἰ"      
##  [31] "πᾶς"      "περί"     "ὦ"        "ἔχω"      "τοιοῦτος" "πρός"    
##  [37] "ἀγαθός"   "λόγος"    "ἑαυτοῦ"   "οὕτως"    "πολύς"    "δοκέω"   
##  [43] "ποιέω"    "ἐάν"      "κακός"    "πόλις"    "εἰς"      "οὐδείς"  
##  [49] "κατά"     "εἶπον"    "οἴομαι"   "ἐκ"       "καλός"    "ἄνθρωπος"
##  [55] "διά"      "ἐπί"      "οὐδέ"     "ὑπό"      "οὔτε"     "νῦν"     
##  [61] "δέομαι"   "μέγας"    "βούλομαι" "οὐκοῦν"   "ἀίω"      "οἷος"    
##  [67] "ἕ"        "ἀνήρ"     "ὀρθός"    "πάνυ"     "ἀληθής"   "φίλος"   
##  [73] "ἐκεῖνος"  "πῶς"      "πρότερος" "οἶδα"     "αὖ"       "νόμος"   
##  [79] "ὥσπερ"    "ἄρα"      "ψυχή"     "σωκράτης" "πράσσω"   "ἆρα"     
##  [85] "δίκαιος"  "ἦ"        "φαίνω"    "παρά"     "μετά"     "ἀδικέω"  
##  [91] "ἡδονά"    "ὅσος"     "ἕκαστος"  "ἔοικα"    "σῶμα"     "μήτε"    
##  [97] "ὅδε"      "ἔτι"      "μᾶλλον"   "μόνος"

A lemma in this list will be excluded from the analysis: σωκράτης (which is a proper name absent from some later dialogues).

mfw99 <- mfw100[-82]
names <- names(mfw99)

Dialogues (12): Frequency Lists

In order to perform delta, we first need to make frequency lists for each dialogue in our corpus (the texts to be compared in rows, the variables in columns). Dialogues are stored as character vectors within my_corpus.

Chrm_freq <- make.frequency.list(my_corpus1$Charmides, value = TRUE)
Grg_freq <- make.frequency.list(my_corpus1$Gorgias, value = TRUE)
Ly_freq <- make.frequency.list(my_corpus1$Lysis, value = TRUE)
Prt_freq <- make.frequency.list(my_corpus1$Protagoras, value = TRUE)
R2_freq <- make.frequency.list(my_corpus1$Republic2, value = TRUE)
R3_freq <- make.frequency.list(my_corpus1$Republic3, value = TRUE)
R8_freq <- make.frequency.list(my_corpus1$Republic8, value = TRUE)
R9_freq <- make.frequency.list(my_corpus1$Republic9, value = TRUE)
L1_freq <- make.frequency.list(my_corpus1$Laws1, value = TRUE)
L2_freq <- make.frequency.list(my_corpus1$Laws2, value = TRUE)
L8_freq <- make.frequency.list(my_corpus1$Laws8, value = TRUE)
L9_freq <- make.frequency.list(my_corpus1$Laws9, value = TRUE)

Table of Frequencies

As we don’t need all the frequencies for the analysis, we subset using the list of features we have just created.

Chrm <- Chrm_freq[names]
Grg <- Grg_freq[names]
Ly <- Ly_freq[names]
Prt <- Prt_freq[names]
R2 <- R2_freq[names]
R3 <- R3_freq[names]
R8 <- R8_freq[names]
R9 <- R9_freq[names]
L1 <- L1_freq[names]
L2 <- L2_freq[names]
L8 <- L8_freq[names]
L9 <- L9_freq[names]
dataset1 <- rbind(Grg, Prt, R8, R9, L8, L9)
dataset2 <- rbind(Ly, Chrm, R2, R3, L1, L2)

dataset1[,1:5]
##             ὁ      καί     εἰμί       δέ    οὗτος
## Grg  9.105061 5.649846 3.174242 1.936439 2.251585
## Prt  8.491149 5.743186 3.276201 2.489463 2.197246
## R8   8.883466 6.516168 2.330689 3.014033 1.683954
## R9   9.346748 6.561680 2.857976 2.828813 1.531059
## L8  11.234434 5.522469 1.759610 3.126692 2.111532
## L9  11.570589 4.589048 1.773744 3.434052 1.691245
dataset2[,1:5]
##              ὁ      καί     εἰμί       δέ    οὗτος
## Ly    8.553009 4.828080 3.868195 3.008596 1.418338
## Chrm  8.338347 5.101672 4.151125 2.791481 1.672482
## R2    8.217489 5.672646 2.399103 3.094170 1.199552
## R3    7.609965 6.374076 2.423122 3.123784 1.440249
## L1    9.070878 5.471906 1.628106 2.521728 1.995348
## L2   10.405028 6.103352 1.815642 2.402235 2.150838

Performing Delta

As we have initially assumed, the least distance is observed within groups which we defined as Plato 1, 2, and 3.

dist.delta(dataset1)
##           Grg       Prt        R8        R9        L8
## Prt 0.8064134                                        
## R8  1.3021532 1.1781170                              
## R9  1.2147812 1.1051432 0.7033716                    
## L8  1.4715845 1.3392882 1.1485845 1.2703215          
## L9  1.5107271 1.4307145 1.1535585 1.2979888 0.8315440
dist.delta(dataset2)
##             Ly      Chrm        R2        R3        L1
## Chrm 1.0158182                                        
## R2   1.2185163 1.2087662                              
## R3   1.2521667 1.1980437 0.8092858                    
## L1   1.5086234 1.4211472 1.1068653 1.1158083          
## L2   1.4596924 1.4520931 1.0707005 1.0143156 0.6966609

The same data can be presented as matrix (values rounded to two decimal places)

round((as.matrix(dist.delta(dataset1))), 2)
##      Grg  Prt   R8   R9   L8   L9
## Grg 0.00 0.81 1.30 1.21 1.47 1.51
## Prt 0.81 0.00 1.18 1.11 1.34 1.43
## R8  1.30 1.18 0.00 0.70 1.15 1.15
## R9  1.21 1.11 0.70 0.00 1.27 1.30
## L8  1.47 1.34 1.15 1.27 0.00 0.83
## L9  1.51 1.43 1.15 1.30 0.83 0.00
round((as.matrix(dist.delta(dataset2))), 2)
##        Ly Chrm   R2   R3   L1   L2
## Ly   0.00 1.02 1.22 1.25 1.51 1.46
## Chrm 1.02 0.00 1.21 1.20 1.42 1.45
## R2   1.22 1.21 0.00 0.81 1.11 1.07
## R3   1.25 1.20 0.81 0.00 1.12 1.01
## L1   1.51 1.42 1.11 1.12 0.00 0.70
## L2   1.46 1.45 1.07 1.01 0.70 0.00

Cluster Analysis

hc1 <- hclust(dist.delta(dataset1))
hcd1 <- as.dendrogram(hc1)
plot(hcd1, type = "rectangle", xlab = "Distance", horiz = TRUE)

hc2 <- hclust(dist.delta(dataset2))
hcd2 <- as.dendrogram(hc2)
plot(hcd2, type = "rectangle", xlab = "Distance", horiz = TRUE)

Results

While for groups Grg-Prt, R. 2-3, R.8-9, L.1-2, L.8-9 the Delta is below 1, in Ly.-Chrm. group it is slightly above 1, which is comparable to the distance between L. 2 and R. 3, attributed to different Platos. It is worth reminding that in our previous experiments, there was some confusion between Ly. and R.: thus, with 100 mfw and 500-word samples 25% of samples were classified into Plato 2, with 1000-word samples the percentage of misclassification was about 15%. The computations above suggest that we need to redefine the profile of Plato 1 in Set 1.

Redefining Plato 1 in Set 1

dir.create("corpus2")
file.copy(c("Apology.txt", "Cleitophon.txt", "Cratylus.txt", "Crito.txt", "Euthydemus.txt", "Euthyphro.txt", "HippiasMajor.txt", "HippiasMinor.txt","Laches.txt"), "corpus2")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
my_corpus2 <- load.corpus.and.parse(files = "all", corpus.dir = "corpus2", markup.type= "plain", corpus.lang = "Other", sampling = "no.sampling", preserve.case = FALSE, encoding = "UTF-8")

To facilitate things a bit, I shall use the same list of features we made earlier.

Apol_freq <- make.frequency.list(my_corpus2$Apology, value = TRUE)
Cleit_freq <- make.frequency.list(my_corpus2$Cleitophon, value = TRUE)
Crat_freq <- make.frequency.list(my_corpus2$Cratylus, value = TRUE)
Crit_freq <- make.frequency.list(my_corpus2$Crito, value = TRUE)
Euthd_freq <- make.frequency.list(my_corpus2$Euthydemus, value = TRUE)
Euthph_freq <- make.frequency.list(my_corpus2$Euthyphro, value = TRUE)
HiMa_freq <- make.frequency.list(my_corpus2$HippiasMajor, value = TRUE)
HiMi_freq <- make.frequency.list(my_corpus2$HippiasMinor, value = TRUE)
Lch_freq <- make.frequency.list(my_corpus2$Laches, value = TRUE)
Apol <- Apol_freq[names]
Cleit <- Cleit_freq[names]
Crat <- Crat_freq[names]
Crit <- Crit_freq[names]
Euthd <- Euthd_freq[names] 
Euthph <-  Euthph_freq[names] 
HiMa <- HiMa_freq[names]
HiMi <- HiMi_freq[names]
Lch <- Lch_freq[names]

dataset3 <- rbind(Ly, Chrm, Grg, Prt, Apol, Cleit, Crat, Crit, Euthd, Euthph, HiMa, HiMi, Lch)
round(as.matrix(dist.delta(dataset3)), 2)
##          Ly Chrm  Grg  Prt Apol Cleit Crat Crit Euthd Euthph HiMa HiMi  Lch
## Ly     0.00 0.98 1.11 1.13 1.28  1.63 1.16 1.37  0.91   1.18 1.03 1.17 1.19
## Chrm   0.98 0.00 1.00 1.01 1.15  1.51 1.11 1.27  0.81   1.22 0.94 1.28 0.99
## Grg    1.11 1.00 0.00 0.80 0.92  1.15 0.91 1.01  0.77   1.01 0.88 0.92 0.90
## Prt    1.13 1.01 0.80 0.00 0.87  1.19 0.98 1.08  0.77   1.16 0.99 1.12 0.93
## Apol   1.28 1.15 0.92 0.87 0.00  1.36 1.25 1.07  1.05   1.26 1.12 1.34 1.02
## Cleit  1.63 1.51 1.15 1.19 1.36  0.00 1.28 1.53  1.32   1.35 1.45 1.55 1.44
## Crat   1.16 1.11 0.91 0.98 1.25  1.28 0.00 1.34  0.96   1.10 0.98 1.19 1.08
## Crit   1.37 1.27 1.01 1.08 1.07  1.53 1.34 0.00  1.10   1.40 1.13 1.27 1.21
## Euthd  0.91 0.81 0.77 0.77 1.05  1.32 0.96 1.10  0.00   1.14 0.84 1.13 0.88
## Euthph 1.18 1.22 1.01 1.16 1.26  1.35 1.10 1.40  1.14   0.00 0.86 1.21 1.07
## HiMa   1.03 0.94 0.88 0.99 1.12  1.45 0.98 1.13  0.84   0.86 0.00 1.10 1.01
## HiMi   1.17 1.28 0.92 1.12 1.34  1.55 1.19 1.27  1.13   1.21 1.10 0.00 1.14
## Lch    1.19 0.99 0.90 0.93 1.02  1.44 1.08 1.21  0.88   1.07 1.01 1.14 0.00
hc3 <- hclust(dist.delta(dataset3))
hcd3 <- as.dendrogram(hc3)
plot(hcd3, type = "rectangle", xlab = "Distance", horiz = TRUE)
abline(v=1,col="red",lty=2)

While this time the delta for Chrm. and Ly. is slightly below 1, the dendrogram suggests that HiMa and Euthph are more proximate stylistically. Let us see if we can now plug this pair into Set 1.

Testing Sample Size (Set 1 New Plato 1)

dir.create("corpus3")
file.copy(c("HippiasMajor.txt", "Euthyphro.txt","Republic2.txt","Republic3.txt", "Laws1.txt","Laws2.txt"), "corpus3")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
setwd("corpus3")
file.names <- list.files()
new.file.names <- c("Pl1_Euthyphro.txt",    "Pl1_HippiasMajor.txt", "Pl3_Laws1.txt", "Pl3_Laws2.txt", "Pl2_Republic2.txt", "Pl2_Republic3.txt")
file.rename(from = file.names, to = new.file.names)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
setwd("~/R_Workflow/2_Three_Platos_Distances")
sp3 <- size.penalize(corpus.dir = "corpus3", mfw = c(35, 70, 100), sample.size.coverage = c(500, 1000, 1500), classification.method = "delta")
sp3$accuracy.scores
## $Pl1_Euthyphro
##          500 1000 1500
## mfw_35  0.95 0.98    1
## mfw_70  0.98 0.99    1
## mfw_100 0.96 1.00    1
## 
## $Pl1_HippiasMajor
##          500 1000 1500
## mfw_35  0.97 0.99 0.99
## mfw_70  0.98 1.00 1.00
## mfw_100 0.88 0.99 0.98
## 
## $Pl2_Republic2
##          500 1000 1500
## mfw_35  0.85 0.96 0.99
## mfw_70  0.93 1.00 1.00
## mfw_100 0.94 1.00 1.00
## 
## $Pl2_Republic3
##          500 1000 1500
## mfw_35  0.75 0.87 0.90
## mfw_70  0.69 0.80 0.86
## mfw_100 0.83 0.90 0.98
## 
## $Pl3_Laws1
##          500 1000 1500
## mfw_35  0.91 0.99 1.00
## mfw_70  0.92 0.98 0.97
## mfw_100 0.97 0.99 1.00
## 
## $Pl3_Laws2
##          500 1000 1500
## mfw_35  0.87 0.98 0.99
## mfw_70  0.92 0.97 1.00
## mfw_100 0.91 0.99 1.00
## 
## attr(,"description")
## [1] "accuracy scores for the tested texts"
sp3$confusion.matrices
## $Pl1_Euthyphro
## $Pl1_Euthyphro$mfw_35
##     500 1000 1500
## Pl1  95   98  100
## Pl2   1    0    0
## Pl3   4    2    0
## 
## $Pl1_Euthyphro$mfw_70
##     500 1000 1500
## Pl1  98   99  100
## Pl2   2    1    0
## Pl3   0    0    0
## 
## $Pl1_Euthyphro$mfw_100
##     500 1000 1500
## Pl1  96  100  100
## Pl2   4    0    0
## Pl3   0    0    0
## 
## 
## $Pl1_HippiasMajor
## $Pl1_HippiasMajor$mfw_35
##     500 1000 1500
## Pl1  97   99   99
## Pl2   3    1    1
## Pl3   0    0    0
## 
## $Pl1_HippiasMajor$mfw_70
##     500 1000 1500
## Pl1  98  100  100
## Pl2   1    0    0
## Pl3   1    0    0
## 
## $Pl1_HippiasMajor$mfw_100
##     500 1000 1500
## Pl1  88   99   98
## Pl2  12    1    2
## Pl3   0    0    0
## 
## 
## $Pl2_Republic2
## $Pl2_Republic2$mfw_35
##     500 1000 1500
## Pl1   0    0    0
## Pl2  85   96   99
## Pl3  15    4    1
## 
## $Pl2_Republic2$mfw_70
##     500 1000 1500
## Pl1   1    0    0
## Pl2  93  100  100
## Pl3   6    0    0
## 
## $Pl2_Republic2$mfw_100
##     500 1000 1500
## Pl1   2    0    0
## Pl2  94  100  100
## Pl3   4    0    0
## 
## 
## $Pl2_Republic3
## $Pl2_Republic3$mfw_35
##     500 1000 1500
## Pl1   1    0    0
## Pl2  75   87   90
## Pl3  24   13   10
## 
## $Pl2_Republic3$mfw_70
##     500 1000 1500
## Pl1   0    0    0
## Pl2  69   80   86
## Pl3  31   20   14
## 
## $Pl2_Republic3$mfw_100
##     500 1000 1500
## Pl1   1    0    0
## Pl2  83   90   98
## Pl3  16   10    2
## 
## 
## $Pl3_Laws1
## $Pl3_Laws1$mfw_35
##     500 1000 1500
## Pl1   0    0    0
## Pl2   9    1    0
## Pl3  91   99  100
## 
## $Pl3_Laws1$mfw_70
##     500 1000 1500
## Pl1   0    0    0
## Pl2   8    2    3
## Pl3  92   98   97
## 
## $Pl3_Laws1$mfw_100
##     500 1000 1500
## Pl1   0    0    0
## Pl2   3    1    0
## Pl3  97   99  100
## 
## 
## $Pl3_Laws2
## $Pl3_Laws2$mfw_35
##     500 1000 1500
## Pl1   0    0    0
## Pl2  13    2    1
## Pl3  87   98   99
## 
## $Pl3_Laws2$mfw_70
##     500 1000 1500
## Pl1   0    0    0
## Pl2   8    3    0
## Pl3  92   97  100
## 
## $Pl3_Laws2$mfw_100
##     500 1000 1500
## Pl1   0    0    0
## Pl2   9    1    0
## Pl3  91   99  100
## 
## 
## attr(,"description")
## [1] "all classification scores (raw tables)"

Now HiMa-Euthph is a clearly distinguished group, but there remains some confusion between the first books of the Laws and the Republic. Let us build another dendrogram to see which books of R. and L. are most remote stylistically.

Republic vs. Laws

We already have frequencies for R. 2-3, 8-9, and L. 1-2, 8-9. Let us make frequency lists for the remaining books.

dir.create("corpus4")
file.copy(c("Republic1.txt", "Republic4.txt","Republic5.txt", "Republic6.txt","Republic7.txt", "Republic10.txt", "Laws3.txt","Laws4.txt", "Laws5.txt", "Laws6.txt", "Laws7.txt", "Laws10.txt", "Laws11.txt", "Laws12.txt"), "corpus4")
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
my_corpus4 <- load.corpus.and.parse(files = "all", corpus.dir = "corpus4", markup.type= "plain", corpus.lang = "Other", sampling = "no.sampling", preserve.case = FALSE, encoding = "UTF-8")
R1_freq <- make.frequency.list(my_corpus4$Republic1, value = TRUE)
R4_freq <- make.frequency.list(my_corpus4$Republic4, value = TRUE)
R5_freq <- make.frequency.list(my_corpus4$Republic5, value = TRUE)
R6_freq <- make.frequency.list(my_corpus4$Republic6, value = TRUE)
R7_freq <- make.frequency.list(my_corpus4$Republic7, value = TRUE)
R10_freq <- make.frequency.list(my_corpus4$Republic10, value = TRUE)
L3_freq <- make.frequency.list(my_corpus4$Laws3, value = TRUE)
L4_freq <- make.frequency.list(my_corpus4$Laws4, value = TRUE)
L5_freq <- make.frequency.list(my_corpus4$Laws5, value = TRUE)
L6_freq <- make.frequency.list(my_corpus4$Laws6, value = TRUE)
L7_freq <- make.frequency.list(my_corpus4$Laws7, value = TRUE)
L10_freq <- make.frequency.list(my_corpus4$Laws10, value = TRUE)
L11_freq <- make.frequency.list(my_corpus4$Laws11, value = TRUE)
L12_freq <- make.frequency.list(my_corpus4$Laws12, value = TRUE)

To subset frequencies for mfw:

R1 <- R1_freq[names]
R4 <- R4_freq[names]
R5 <- R5_freq[names]
R6 <- R6_freq[names]
R7 <- R7_freq[names]
R10 <- R10_freq[names]
L3 <- L3_freq[names]
L4 <- L4_freq[names]
L5 <- L5_freq[names]
L6 <- L6_freq[names]
L7 <- L7_freq[names]
L10 <- L10_freq[names]
L11 <- L11_freq[names]
L12 <- L12_freq[names]
dataset4 <- rbind(R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, L1, L2, L3, L4, L5, L6, L7, L8, L9, L10, L11, L12)
hc4 <- hclust(dist.delta(dataset4))
hcd4 <- as.dendrogram(hc4)
hcd4 %>% set("labels_col", value = c("navy", "magenta"), k=4) %>% 
plot(horiz = TRUE)
abline(v=1,col="purple",lty=2)

R. 2-3 and L. 1-2 are already in different clusters. Yet we can pick books with max distance for R2-3: L11 and L6, and also redefine Plato 3 in Set 1.

m4 <- as.matrix(dist.delta(dataset4))
m4part <- m4[,2:3]
m4part
##            R2        R3
## R1  1.0079545 1.0182222
## R2  0.0000000 0.7931216
## R3  0.7931216 0.0000000
## R4  0.9444780 0.7595450
## R5  0.8203685 0.6679126
## R6  0.8354431 0.8099925
## R7  1.0144477 0.8327298
## R8  1.0336879 0.8688327
## R9  1.0643546 0.8086371
## R10 0.8554122 0.8093287
## L1  1.2031633 1.1370590
## L2  1.1273998 1.0313713
## L3  1.1429056 1.1564221
## L4  1.1631368 1.1576791
## L5  1.1994261 1.2028352
## L6  1.3995590 1.3405549
## L7  1.1867852 1.0370860
## L8  1.4223065 1.2616897
## L9  1.4156080 1.2222859
## L10 1.1526823 1.0032480
## L11 1.5041857 1.3228621
## L12 1.2993280 1.2222116

Redefining Plato 3 in Set 1

dir.create("corpus5")
file.copy(c("HippiasMajor.txt","Euthyphro.txt","Laws6.txt","Laws11.txt", "Republic2.txt","Republic3.txt"), "corpus5")
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
setwd("corpus5")
file.names <- list.files()
file.names
## [1] "Euthyphro.txt"    "HippiasMajor.txt" "Laws11.txt"       "Laws6.txt"       
## [5] "Republic2.txt"    "Republic3.txt"
new.file.names <- c("Pl1_Euthyphro.txt", "Pl1_HippiasMajor.txt", "Pl3_Laws11.txt", "Pl3_Laws6.txt", "Pl2_Republic2.txt", "Pl2_Republic3.txt")
file.rename(from = file.names, to = new.file.names)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
setwd("~/R_Workflow/2_Three_Platos_Distances")
sp5 <- size.penalize(corpus.dir = "corpus5", mfw = c(35, 70, 100), sample.size.coverage = c(500, 1000, 1500), classification.method = "delta")
sp5$accuracy.scores
## $Pl1_Euthyphro
##          500 1000 1500
## mfw_35  0.97 0.99    1
## mfw_70  0.99 1.00    1
## mfw_100 0.95 0.99    1
## 
## $Pl1_HippiasMajor
##          500 1000 1500
## mfw_35  0.95 0.98 0.99
## mfw_70  1.00 0.99 1.00
## mfw_100 0.84 0.88 0.98
## 
## $Pl2_Republic2
##          500 1000 1500
## mfw_35  0.98    1    1
## mfw_70  1.00    1    1
## mfw_100 1.00    1    1
## 
## $Pl2_Republic3
##          500 1000 1500
## mfw_35  0.96 1.00    1
## mfw_70  0.95 0.99    1
## mfw_100 0.96 1.00    1
## 
## $Pl3_Laws11
##         500 1000 1500
## mfw_35    1    1    1
## mfw_70    1    1    1
## mfw_100   1    1    1
## 
## $Pl3_Laws6
##          500 1000 1500
## mfw_35  0.94 0.98 0.99
## mfw_70  1.00 1.00 1.00
## mfw_100 1.00 1.00 1.00
## 
## attr(,"description")
## [1] "accuracy scores for the tested texts"

Results. Set 1 and 2 Redefined

As a result of these computations, we can redefine our Set 1 to achieve higher accuracy scores:

  • Set 1: Plato 1 (HiMa., Euthph) vs. Plato 2 (R. 2 and 3) vs. Plato 3 (Lg 6 and 11)

  • Set 2: Plato 1 (Prt., Grg.) vs. Plato 2 (R. 8 and 9) vs. Plato 3 (Lg. 8 and 9)