library(readr)
library(dplyr)
library(stringr)
library(tidytext)
library(stringr)
library(stopwords)
library(ggplot2)
library(tidyverse)
library(tidylo)
library(forcats)
library(caret)
library(quanteda)
library(glmnet)
library(gbm)
library(tm)
library(SnowballC)
library(caTools)
library(randomForest)
In the previous work, I compared all poems written by Arseny Tarkovsky to the 500 “main” poems of the Silver Age. Once again, the following research question was formulated: “Is it legit to attribute Arseny Tarkovsky to the poets of the Silver Age basing on the words he used?”.
I re-collected the second dataset to make the comparison more serious. Now it consists of 3702 poems written by 125 poets (against 160 Tarkovsky’s poems) and contains an author variable. The reason why I decided to include it is the relatively strange poems’ selection at the cite used for the text mining. The picture below shows this bias:
tarko_poems <- read_csv("tarko_poetry.csv") %>% select(poetry)
silver <- read.csv("silverr.csv", encoding = "KOI8-R") %>% select(!"X")
author_freq <- silver %>% select(!"review") %>% count(author) %>% arrange(n)
silversub <- author_freq[111:126, 1:2]
ggplot(silversub, aes(reorder(author, n), n)) +
geom_bar(stat = "identity", fill = "#9ECE9A") +
coord_flip() +
theme_classic() + xlab("") + ylab("") +
ggtitle("The major contributors to the Silver Age dataset") + labs(caption = "source: https://slova.org.ru/") + theme(title = element_text(size = 14))
Before moving to the current work, I want to briefly highlight some obvious limitations partly cored in the data:
it is hard to identify the Silver Age period. The lifetime of the selected poets is not the proper source for the periodization as it varies! For example, Nikolai Zabolotsky died in 1958, while the most known representatives of the Silver Age like Esenin or Mayakovsky died in 1925 and 1930 respectively. To note, Tarkovsky’s lifetime thus intersects the expanded period and his time-related vocabulary (“телефон”, for instance) might occur in the reference dataset.
not all poets and poems are included what is due to the problem of selection (who and until which year?). For example, the data contains members of the ОБЭРИУ group (Kharms, Vvedensky) but does not involve Konstantin Vaginov.
selected poets contribute differently: see again the picture above.
Firstly, I performed lemmatization as the word variations drop the words’ frequency significantly:
tarko_poems$lem <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = tarko_poems$poetry, stdout = TRUE)
silver$lem <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = silver$review, stdout = TRUE)
Some more manipulations: converting variables to the same forms, joining data, and tokenization:
tarko_poems$set <- "tarko"
silver$set <- "silver"
tarko_poems$text <- tarko_poems$poetry
silver$text <- silver$review
tarko_poems <- tarko_poems %>% select(!"poetry")
silver <- silver %>% select(!"link" & !"review" & !"author")
all_poems <- silver %>% full_join(tarko_poems)
all_poems_token <- all_poems %>% unnest_tokens(lem, lem)
I thought of leaving the stopwords in the datasets at first because their usage might be distinctive (this difference should be proven but it seemed to me more possible than the one based on signs: for example, it is a well-known fact that Marina Tsvetaeva used many dashes but it cannot be generalized for the set of 126 different poets!). So, removing the stopwords is obviously a step that affects the results. On the other hand, the TF-IDF gives more value to the rare words, so stopwords might be removed.
All in all, I tried to work with the data that includes stopwords for a while and understood that it is an additional burden my laptop. So, I removed them (and some signs):
all_poems_token = all_poems_token %>%
filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
rustopwords = data.frame(words=stopwords("ru"), stringsAsFactors=FALSE)
all_poems_token = filter(all_poems_token,!(lem %in% c(stopwords("ru"))))
To note shortly, the stopwords occupied 127k words in the silver data (around a 1/3). This dataset is still much larger than the one I used in the previous work and its Zipf’s curve is smoother:
silver_token <- silver %>% unnest_tokens(lem, lem)
silver_token = silver_token %>%
filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]")) %>% filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
silver_token = filter(silver_token,!(lem %in% c(stopwords("ru"))))
silver_token %>%
dplyr::count(lem, sort = TRUE) %>%
dplyr::mutate(rank = row_number()) %>%
ggplot(aes(rank, n)) +
geom_line(color = "#9ECE9A", size = 1.5) +
scale_x_log10() +
scale_y_log10() + theme_classic() + ggtitle("Zipf's curve of the Silver Age poets") + xlab("") + ylab("") + labs(subtitle = "126 poets & 3702 poems", caption = "source: https://slova.org.ru/") + theme(title = element_text(size = 12))
Further, I create a document-term matrix with TF-IDF and eliminate top and bottom 2% of words. Overall, document-feature matrix is of: (1) 3,858 documents and (2) 484 features (95.4% sparse, what is not good).
all_poems1 <- all_poems_token %>% count(text, lem) %>% cast_dfm(text, lem, n)
poems_clean = all_poems1 %>% as.dfm %>%
dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
dfm_tfidf
And finally, I split the data into the train and test subsets:
set.seed(42)
library(rsample)
poems_split <- initial_split(all_poems, prop=0.8, strata="set")
poems_split
## <Analysis/Assess/Total>
## <3090/772/3862>
poems_train <- training(poems_split)
table(poems_train$set)
##
## silver tarko
## 2960 130
poems_test <- testing(poems_split)
table(poems_test$set)
##
## silver tarko
## 742 30
train.dtm <- poems_clean %>% dfm_subset(docnames(all_poems1) %in% poems_train$text)
test.dtm <- poems_clean %>% dfm_subset(docnames(all_poems1) %in% poems_test$text)
trainY <- poems_train$set
testY <- poems_test$set
I used lasso regularization for this model: so, the it ascribes the coefficients of some less contributive variables to 0 and keeps the most significant variables only.
set.seed(42)
trainY1 <- trainY[1:3088] # I failed to correct the mistake in a different way, so the solution is that inaccurate here and further.
cv.lasso <- cv.glmnet(x=train.dtm, y=trainY1, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.0007283378
set.seed(42)
testY1 <- testY[1:771]
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))
cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY1), positive="tarko")
table(predicted.lasso)
## predicted.lasso
## silver tarko
## 766 5
cm.lasso
## Confusion Matrix and Statistics
##
## Reference
## Prediction silver tarko
## silver 739 27
## tarko 3 2
##
## Accuracy : 0.9611
## 95% CI : (0.9449, 0.9736)
## No Information Rate : 0.9624
## P-Value [Acc > NIR] : 0.622
##
## Kappa : 0.1078
##
## Mcnemar's Test P-Value : 2.679e-05
##
## Sensitivity : 0.068966
## Specificity : 0.995957
## Pos Pred Value : 0.400000
## Neg Pred Value : 0.964752
## Prevalence : 0.037613
## Detection Rate : 0.002594
## Detection Prevalence : 0.006485
## Balanced Accuracy : 0.532461
##
## 'Positive' Class : tarko
##
cm.lasso$byClass
## Sensitivity Specificity Pos Pred Value
## 0.068965517 0.995956873 0.400000000
## Neg Pred Value Precision Recall
## 0.964751958 0.400000000 0.068965517
## F1 Prevalence Detection Rate
## 0.117647059 0.037613489 0.002594034
## Detection Prevalence Balanced Accuracy
## 0.006485084 0.532461195
The results show that the model performs poorly. Despite the accuracy = 0.9611 is extremely high, Kappa = 0.1078 shows that the model is just slightly than the random values assignment. As for the other measures, precision = 0.068 (if changing positive in a confusion matrix to “silver” instead of “tarko”, the value is opposite and equals to 0.99), recall = 0.99, and F1 = 0.12.
coef(cv.lasso, cv.lasso$lambda.min) %>%
as.matrix %>% as.data.frame %>%
tibble::rownames_to_column() %>%
arrange(-abs(`1`)) %>% head(10)
## rowname 1
## 1 (Intercept) -2.9925281
## 2 май -1.0938285
## 3 поле -1.0791111
## 4 солнце -1.0526163
## 5 темный -1.0007440
## 6 туча -0.9018710
## 7 январь -0.8550315
## 8 земля 0.8291637
## 9 взор -0.7860103
## 10 лишь -0.7736142
The most important coefficients are presented above. 8 words our of 9 are negative what means that they are relevant for the Silver Age poems predictions. The word (“земля”) is significant for the Tarkovsky poems - from the previous laboratory work I know that its frequency in the corpus is the third largest (not counting the stopwords).
poems_split1 <- as.data.frame(poems_split)
errors <- all_poems %>% anti_join(poems_split1)
errors_tarko <- errors %>% filter(predicted.lasso == "tarko" & testY1 == "silver")
errors_silver <- errors %>% filter(predicted.lasso == "silver" & testY1 == "tarko")
errors_silver[1,3]
## [1] "До сих пор мне было невдомек — Для чего мне звездный каталог? В каталоге десять миллионов Номеров небесных телефонов, Десять миллионов номеров Телефонов марев и миров, Полный свод свеченья и мерцанья, Список абонентов мирозданья. Я-то знаю, как зовут звезду, Я и телефон ее найду, Пережду я очередь земную, Поверну я азбуку стальную: — А-13-40-25. Я не знаю, где тебя искать. Запоет мембрана телефона: — Отвечает альфа Ориона. Я в дороге, я теперь звезда, Я тебя забыла навсегда. Я звезда — денницына сестрица, Я тебе не захочу присниться, До тебя мне дела больше нет. Позвони мне через триста лет."
errors_tarko[1,3]
## [1] " Ива с дубом, мечтая, росли у пруда... Дуб тянулся всё к небу прекрасному, Где веселые звезды под вечер всегда Зажигались светить миру страстному. Ива вниз наклонялась к зеленым струям, К тем струям, что светила несметные Отражали в себе по осенним ночам Да журчали сказанья приветные. Не дорос дуб до тверди небес голубой, Ива ветки зеленые выгнула И коснулась воды, и омылась водой, А горящих светил не достигнула. Август 1882 "
An obvious critic of this model is about the unequal sizes of the datasets, of course. Due to that, there are more false positive (27) than false negative (3) results (the model literally knows more words related to the silver poems thus identifying the poems written by Tarkovsky with relatively rare words as silver age poems). The example of the misidentified poem is the first above. Interestingly, the word “телефон” relatively frequent in Tarkovsky’s corpus did not help model to make a right choice.
As for the false negative cases, I have an exhaustive explanation for the second shown poem: it is full of the Tarkovsky’s frequent words I identified in the last work: “вода”, “ночь”, “свет”, and maybe colors.
Here I briefly try to overcome the problem of size and select randomly 160 obserations from the reference set just to show that the deal is about size. The size of 160 was selected because it is the exact size of the Tarkovsky corpus.
set.seed(42)
sub <- silver[sample(c(1:3702), 160, replace = TRUE),]
tarko_and_sub <- sub %>% full_join(tarko_poems)
tarkosub_token <- tarko_and_sub %>% unnest_tokens(lem, lem)
tarkosub_token <- tarkosub_token %>% filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
tarkosubb_token = filter(tarkosub_token,!(lem %in% c(stopwords("ru"))))
Creating a document-term matrix with TF-IDF and eliminating top and bottom 2% of words: Overall, document-feature matrix is of: (1) 313 documents and (2) 481 features (95.4% sparse, what is not good).
tarko_and_sub1 <- tarkosubb_token %>% count(text, lem) %>% cast_dfm(text, lem, n)
tarkosub_clean = tarko_and_sub1 %>% as.dfm %>%
dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
dfm_tfidf
Splitting data:
set.seed(42)
library(rsample)
tarkosub_split <- initial_split(tarko_and_sub, prop=0.8, strata="set")
tarkosub_split
## <Analysis/Assess/Total>
## <258/62/320>
tarkosub_train <- training(tarkosub_split)
table(tarkosub_train$set)
##
## silver tarko
## 129 129
tarkosub_test <- testing(tarkosub_split)
table(tarkosub_test$set)
##
## silver tarko
## 31 31
train.dtm <- tarkosub_clean %>% dfm_subset(docnames(tarko_and_sub1) %in% tarkosub_train$text)
test.dtm <- tarkosub_clean %>% dfm_subset(docnames(tarko_and_sub1) %in% tarkosub_test$text)
trainY <- tarkosub_train$set
testY <- tarkosub_test$set
Model:
set.seed(42)
trainY <- trainY[1:253]
cv.lasso <- cv.glmnet(x=train.dtm, y=trainY, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.003712717
set.seed(42)
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))
table(predicted.lasso)
## predicted.lasso
## silver tarko
## 25 37
cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY), positive="tarko")
cm.lasso
## Confusion Matrix and Statistics
##
## Reference
## Prediction silver tarko
## silver 19 6
## tarko 12 25
##
## Accuracy : 0.7097
## 95% CI : (0.5805, 0.818)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.0006495
##
## Kappa : 0.4194
##
## Mcnemar's Test P-Value : 0.2385928
##
## Sensitivity : 0.8065
## Specificity : 0.6129
## Pos Pred Value : 0.6757
## Neg Pred Value : 0.7600
## Prevalence : 0.5000
## Detection Rate : 0.4032
## Detection Prevalence : 0.5968
## Balanced Accuracy : 0.7097
##
## 'Positive' Class : tarko
##
cm.lasso$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8064516 0.6129032 0.6756757
## Neg Pred Value Precision Recall
## 0.7600000 0.6756757 0.8064516
## F1 Prevalence Detection Rate
## 0.7352941 0.5000000 0.4032258
## Detection Prevalence Balanced Accuracy
## 0.5967742 0.7096774
The classification quality indicators are much better here. Firstly, the distribution among the confusion matrix cells is more even. Further, despite the accuracy = 0.7, the other indexes - precision = 0.8, recall = 0.61, F1 = 0.73, and kappa = 0.41 shows that the model is.. fair (?).
Words & coefficients for the subset (there are many intersections with the previous table! maybe, that one was not that bad):
coef(cv.lasso, cv.lasso$lambda.min) %>%
as.matrix %>% as.data.frame %>%
tibble::rownames_to_column() %>%
arrange(-abs(`1`)) %>% head(20)
## rowname 1
## 1 солнце -1.3599981
## 2 тонкий -1.2990794
## 3 взор -1.2714516
## 4 даль -1.2630106
## 5 дух -1.1009027
## 6 судьба 1.0200812
## 7 апрель -0.9978484
## 8 мечта -0.9670253
## 9 пламень -0.9135135
## 10 дремать -0.9070201
## 11 время 0.8828687
## 12 маленький -0.8795499
## 13 миг -0.8622339
## 14 весна -0.8469621
## 15 мост 0.8397353
## 16 нога 0.8106568
## 17 поток -0.8042511
## 18 строгий -0.8039094
## 19 хороший 0.7731458
## 20 день -0.7631981
Two additional comments on this part that I thought of when it was too late to change the report:
it would bу great to use 100 or even more random subsamples and take the average from the statistics they have;
even a better idea is to divide the large sample of the Silver Age into approximatelly 10 distinct corpora basing on the poets’ groups: acmeists, futurists, imagists and so on. I created 5 models as the one above and the accuracy varied from 0.65 to 0.88 and the reason for that (as I guess) is deepen in the different shares of different groups. But it is to be checked in future.
My computer was not able to run this method on the total data, so I decided to use the subset from the previous section (yeah, not representative at all).
newmatrix <- as.matrix(tarkosub_clean)
colnames(newmatrix) = make.names(colnames(newmatrix))
Some more manipulations and the baseline level of accuracy:
newmatrix <- as.data.frame(newmatrix)
tarko_and_sub <- tarko_and_sub[1:313, 1:3]
newmatrix$set <- tarko_and_sub$set
prop.table(table(newmatrix$set))
##
## silver tarko
## 0.5111821 0.4888179
Splitting data:
set.seed(42)
split = sample.split(newmatrix$set, SplitRatio = 0.8)
trainmatrix = subset(newmatrix, split==TRUE)
testmatrix = subset(newmatrix, split==FALSE)
Running a method:
set.seed(42)
trainmatrix$set = as.factor(trainmatrix$set)
testmatrix$set = as.factor(testmatrix$set)
RF_model = randomForest(set ~ ., data=trainmatrix)
predictRF = predict(RF_model, newdata=testmatrix)
table(testmatrix$set, predictRF)
## predictRF
## silver tarko
## silver 28 4
## tarko 8 23
summary(testmatrix$set)
## silver tarko
## 32 31
A funny detail: I was a bit confused at first and started to calculate the quality indicators by hands:
paste("Accuracy = ", round((28+33)/(28+33+4+8),2))
## [1] "Accuracy = 0.84"
paste("Precision = ", round(28/(4+28), 2))
## [1] "Precision = 0.88"
paste("Recall = ", round((28/(28+8)), 2))
## [1] "Recall = 0.78"
precision = 28/(4+28)
recall = 28/(28+8)
paste("F1 =", round((2*precision*recall/(precision + recall)),2))
## [1] "F1 = 0.82"
Pobserved = (28+23)/(28+4+8+23)
Pexpected = ((28+4)*(28+8)/63 + (8+23)*(4+23)/63)/63
paste("kappa =", round((Pobserved - Pexpected)/(1 - Pexpected),2))
## [1] "kappa = 0.62"
Checking out:
accuracytest = confusionMatrix(predictRF, testmatrix$set, positive = "tarko")
accuracytest
## Confusion Matrix and Statistics
##
## Reference
## Prediction silver tarko
## silver 28 8
## tarko 4 23
##
## Accuracy : 0.8095
## 95% CI : (0.6909, 0.8975)
## No Information Rate : 0.5079
## P-Value [Acc > NIR] : 6.935e-07
##
## Kappa : 0.6182
##
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.7419
## Specificity : 0.8750
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.7778
## Prevalence : 0.4921
## Detection Rate : 0.3651
## Detection Prevalence : 0.4286
## Balanced Accuracy : 0.8085
##
## 'Positive' Class : tarko
##
accuracytest$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7419355 0.8750000 0.8518519
## Neg Pred Value Precision Recall
## 0.7777778 0.8518519 0.7419355
## F1 Prevalence Detection Rate
## 0.7931034 0.4920635 0.3650794
## Detection Prevalence Balanced Accuracy
## 0.4285714 0.8084677
The model outperformed the baseline method; its coefficients are also better than for the previous model built with lasso.
imp <- importance(RF_model) %>% as.matrix %>% as.data.frame %>% tibble::rownames_to_column()
imp <- imp %>% arrange(MeanDecreaseGini)
imp[462:481, 1:2]
## rowname MeanDecreaseGini
## 462 это 0.8392288
## 463 весь 0.8460858
## 464 день 0.8547533
## 465 нежный 0.8794229
## 466 судьба 0.9323990
## 467 солнце 0.9487465
## 468 земля 0.9605318
## 469 пламень 0.9623212
## 470 нога 0.9668824
## 471 небо 1.0240643
## 472 миг 1.0414371
## 473 время 1.1626597
## 474 любовь 1.2008082
## 475 горло 1.2009002
## 476 даль 1.2130359
## 477 сердце 1.2827029
## 478 твой 1.3571021
## 479 мечта 1.4045078
## 480 рука 1.7283319
## 481 взор 1.8062685
This list contains many repetitions as well: “солнце”, “мечта”, “взор”, etc. A general comment (for the other lists as well) is the following: their tops include really “poetic” words. Also, as the plot below shows, the 2 words - “взор” and “рука” - are distant from the others.
varImpPlot(RF_model, sort = TRUE, n.var = 20, main = "RF_model classifier & its main variables", frame.plot = FALSE, pt.cex = 1.5, bg = "#9ECE9A", lcolor = "#9ECE9A")
testpred <- predict(RF_model, newdata=testmatrix) %>% as.data.frame()
testpred$actual <- testmatrix$set
testpred$text <- rownames(testpred)
rownames(testpred) <- c(1:63)
error_1 <- testpred %>% filter(. == "silver", actual == "tarko")
error_2 <- testpred %>% filter(. == "tarko", actual == "silver")
error_1[7,3]
## [1] "Стол накрыт на шестерых — Розы да хрусталь… А среди гостей моих — Горе да печаль. И со мною мой отец, И со мною брат. Час проходит. Наконец У дверей стучат. Как двенадцать лет назад, Холодна рука, И немодные шумят Синие шелка. И вино поет из тьмы, И звенит стекло: «Как тебя любили мы, Сколько лет прошло». Улыбнется мне отец, Брат нальет вина, Даст мне руку без колец, Скажет мне она: «Каблучки мои в пыли, Выцвела коса, И звучат из-под земли Наши голоса»."
error_2[1,3]
## [1] " В смешную ванну падал друг Стена кружилася вокруг Корова чудная плыла Над домом улица была И друг мелькая на песке Ходил по комнатам в носке Вертя как фокусник рукой То левой, а потом другой Потом кидался на постель Когда в болотах коростель Чирикал шапочкой и выл Уже мой друг не в ванне был. 5 марта 1927 "
It was still not easy to understand the model logic behind each mistake but I found some cues in the poems shown above. The first one, ascribed to the Silver Age, is constructed from the not frequent words - but a person (me, at least) can predict that it was written by Arseny by the specific word “каблучки” - it appears in the other poem called “Я боюсь, что слишком поздно”:
“<…> И в моей ночи ревнивой Каблучки твои стучат <…>”
As for the second poem written by Daniil Kharms but ascribed to Tarkovsky, it contains many words frequent in the Tarkovsky’s corpus: “дом”, “друг”, etc.
The random forest classifier performed better than the first one with a reservation for the used data - again, it would be more interesting and fair to compare Tarkovsky to the groups of Silver Age, as it seems to me now. An obvious limitations of the second method are (1) the large model size and (2) bias in favor of higher frequency (?).
All in all, it was an insightful work. Thank you!