Binary Classifiers for the Silver Age & Tarkovsky Poems

Libraries, Data, and Limitations

library(readr)
library(dplyr)
library(stringr)
library(tidytext)
library(stringr)
library(stopwords)
library(ggplot2)
library(tidyverse)
library(tidylo)
library(forcats)
library(caret)
library(quanteda)
library(glmnet)
library(gbm)
library(tm)
library(SnowballC)
library(caTools)
library(randomForest)

In the previous work, I compared all poems written by Arseny Tarkovsky to the 500 “main” poems of the Silver Age. Once again, the following research question was formulated: “Is it legit to attribute Arseny Tarkovsky to the poets of the Silver Age basing on the words he used?”.

I re-collected the second dataset to make the comparison more serious. Now it consists of 3702 poems written by 125 poets (against 160 Tarkovsky’s poems) and contains an author variable. The reason why I decided to include it is the relatively strange poems’ selection at the cite used for the text mining. The picture below shows this bias:

tarko_poems <- read_csv("tarko_poetry.csv") %>% select(poetry)
silver <- read.csv("silverr.csv", encoding = "KOI8-R") %>% select(!"X")


author_freq <- silver %>% select(!"review") %>% count(author) %>% arrange(n)
silversub <- author_freq[111:126, 1:2]

ggplot(silversub, aes(reorder(author, n), n)) +
  geom_bar(stat = "identity", fill = "#9ECE9A") +
  coord_flip() +
  theme_classic() + xlab("") + ylab("") +
  ggtitle("The major contributors to the Silver Age dataset") + labs(caption = "source: https://slova.org.ru/") + theme(title = element_text(size = 14))

Before moving to the current work, I want to briefly highlight some obvious limitations partly cored in the data:

  1. it is hard to identify the Silver Age period. The lifetime of the selected poets is not the proper source for the periodization as it varies! For example, Nikolai Zabolotsky died in 1958, while the most known representatives of the Silver Age like Esenin or Mayakovsky died in 1925 and 1930 respectively. To note, Tarkovsky’s lifetime thus intersects the expanded period and his time-related vocabulary (“телефон”, for instance) might occur in the reference dataset.

  2. not all poets and poems are included what is due to the problem of selection (who and until which year?). For example, the data contains members of the ОБЭРИУ group (Kharms, Vvedensky) but does not involve Konstantin Vaginov.

  3. selected poets contribute differently: see again the picture above.

Data Manipulations

Firstly, I performed lemmatization as the word variations drop the words’ frequency significantly:

tarko_poems$lem <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = tarko_poems$poetry, stdout = TRUE)

silver$lem <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = silver$review, stdout = TRUE)

Some more manipulations: converting variables to the same forms, joining data, and tokenization:

tarko_poems$set <- "tarko"
silver$set <- "silver"
tarko_poems$text <- tarko_poems$poetry
silver$text <- silver$review
tarko_poems <- tarko_poems %>% select(!"poetry")
silver <- silver %>% select(!"link" & !"review" & !"author")

all_poems <- silver %>% full_join(tarko_poems)

all_poems_token <- all_poems %>% unnest_tokens(lem, lem)

I thought of leaving the stopwords in the datasets at first because their usage might be distinctive (this difference should be proven but it seemed to me more possible than the one based on signs: for example, it is a well-known fact that Marina Tsvetaeva used many dashes but it cannot be generalized for the set of 126 different poets!). So, removing the stopwords is obviously a step that affects the results. On the other hand, the TF-IDF gives more value to the rare words, so stopwords might be removed.

All in all, I tried to work with the data that includes stopwords for a while and understood that it is an additional burden my laptop. So, I removed them (and some signs):

all_poems_token = all_poems_token %>%
  filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))

rustopwords = data.frame(words=stopwords("ru"), stringsAsFactors=FALSE)
all_poems_token = filter(all_poems_token,!(lem %in% c(stopwords("ru"))))

To note shortly, the stopwords occupied 127k words in the silver data (around a 1/3). This dataset is still much larger than the one I used in the previous work and its Zipf’s curve is smoother:

silver_token <- silver %>% unnest_tokens(lem, lem)
silver_token = silver_token %>%
  filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]")) %>% filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
silver_token = filter(silver_token,!(lem %in% c(stopwords("ru"))))

silver_token %>%
    dplyr::count(lem, sort = TRUE) %>%
    dplyr::mutate(rank = row_number()) %>%
    ggplot(aes(rank, n)) +
    geom_line(color = "#9ECE9A", size = 1.5) +
    scale_x_log10() +
    scale_y_log10() + theme_classic() + ggtitle("Zipf's curve of the Silver Age poets") + xlab("") + ylab("") + labs(subtitle = "126 poets & 3702 poems", caption = "source: https://slova.org.ru/") + theme(title = element_text(size = 12))

Further, I create a document-term matrix with TF-IDF and eliminate top and bottom 2% of words. Overall, document-feature matrix is of: (1) 3,858 documents and (2) 484 features (95.4% sparse, what is not good).

all_poems1 <- all_poems_token %>% count(text, lem) %>% cast_dfm(text, lem, n) 

poems_clean = all_poems1 %>% as.dfm %>%
          dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
          dfm_tfidf

And finally, I split the data into the train and test subsets:

set.seed(42)
library(rsample)
poems_split  <- initial_split(all_poems, prop=0.8, strata="set")
poems_split
## <Analysis/Assess/Total>
## <3090/772/3862>
poems_train <- training(poems_split)
table(poems_train$set)
## 
## silver  tarko 
##   2960    130
poems_test <- testing(poems_split)
table(poems_test$set)
## 
## silver  tarko 
##    742     30
train.dtm  <- poems_clean %>% dfm_subset(docnames(all_poems1) %in% poems_train$text)
test.dtm  <- poems_clean %>% dfm_subset(docnames(all_poems1) %in% poems_test$text)

trainY <- poems_train$set
testY <- poems_test$set

Logistic Gegression Classifier with lasso Regularization - part 1

I used lasso regularization for this model: so, the it ascribes the coefficients of some less contributive variables to 0 and keeps the most significant variables only.

set.seed(42)
trainY1 <- trainY[1:3088] # I failed to correct the mistake in a different way, so the solution is that inaccurate here and further.

cv.lasso <- cv.glmnet(x=train.dtm, y=trainY1, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.0007283378
set.seed(42)
testY1 <- testY[1:771]
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))
cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY1), positive="tarko")

table(predicted.lasso)
## predicted.lasso
## silver  tarko 
##    766      5
cm.lasso
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction silver tarko
##     silver    739    27
##     tarko       3     2
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9449, 0.9736)
##     No Information Rate : 0.9624          
##     P-Value [Acc > NIR] : 0.622           
##                                           
##                   Kappa : 0.1078          
##                                           
##  Mcnemar's Test P-Value : 2.679e-05       
##                                           
##             Sensitivity : 0.068966        
##             Specificity : 0.995957        
##          Pos Pred Value : 0.400000        
##          Neg Pred Value : 0.964752        
##              Prevalence : 0.037613        
##          Detection Rate : 0.002594        
##    Detection Prevalence : 0.006485        
##       Balanced Accuracy : 0.532461        
##                                           
##        'Positive' Class : tarko           
## 
cm.lasso$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##          0.068965517          0.995956873          0.400000000 
##       Neg Pred Value            Precision               Recall 
##          0.964751958          0.400000000          0.068965517 
##                   F1           Prevalence       Detection Rate 
##          0.117647059          0.037613489          0.002594034 
## Detection Prevalence    Balanced Accuracy 
##          0.006485084          0.532461195

The results show that the model performs poorly. Despite the accuracy = 0.9611 is extremely high, Kappa = 0.1078 shows that the model is just slightly than the random values assignment. As for the other measures, precision = 0.068 (if changing positive in a confusion matrix to “silver” instead of “tarko”, the value is opposite and equals to 0.99), recall = 0.99, and F1 = 0.12.

coef(cv.lasso, cv.lasso$lambda.min) %>%
    as.matrix %>% as.data.frame %>%
    tibble::rownames_to_column() %>%
    arrange(-abs(`1`)) %>% head(10)
##        rowname          1
## 1  (Intercept) -2.9925281
## 2          май -1.0938285
## 3         поле -1.0791111
## 4       солнце -1.0526163
## 5       темный -1.0007440
## 6         туча -0.9018710
## 7       январь -0.8550315
## 8        земля  0.8291637
## 9         взор -0.7860103
## 10        лишь -0.7736142

The most important coefficients are presented above. 8 words our of 9 are negative what means that they are relevant for the Silver Age poems predictions. The word (“земля”) is significant for the Tarkovsky poems - from the previous laboratory work I know that its frequency in the corpus is the third largest (not counting the stopwords).

poems_split1 <- as.data.frame(poems_split)
errors <- all_poems %>% anti_join(poems_split1)

errors_tarko <- errors %>% filter(predicted.lasso == "tarko" & testY1 == "silver")
errors_silver <- errors %>% filter(predicted.lasso == "silver" & testY1 == "tarko")

errors_silver[1,3]
## [1] "До сих пор мне было невдомек — Для чего мне звездный каталог? В каталоге десять миллионов Номеров небесных телефонов, Десять миллионов номеров Телефонов марев и миров, Полный свод свеченья и мерцанья, Список абонентов мирозданья. Я-то знаю, как зовут звезду, Я и телефон ее найду, Пережду я очередь земную, Поверну я азбуку стальную: — А-13-40-25. Я не знаю, где тебя искать. Запоет мембрана телефона: — Отвечает альфа Ориона. Я в дороге, я теперь звезда, Я тебя забыла навсегда. Я звезда — денницына сестрица, Я тебе не захочу присниться, До тебя мне дела больше нет. Позвони мне через триста лет."
errors_tarko[1,3]
## [1] " Ива с дубом, мечтая, росли у пруда... Дуб тянулся всё к небу прекрасному, Где веселые звезды под вечер всегда Зажигались светить миру страстному.  Ива вниз наклонялась к зеленым струям, К тем струям, что светила несметные Отражали в себе по осенним ночам Да журчали сказанья приветные.  Не дорос дуб до тверди небес голубой, Ива ветки зеленые выгнула И коснулась воды, и омылась водой, А горящих светил не достигнула.  Август 1882 "

An obvious critic of this model is about the unequal sizes of the datasets, of course. Due to that, there are more false positive (27) than false negative (3) results (the model literally knows more words related to the silver poems thus identifying the poems written by Tarkovsky with relatively rare words as silver age poems). The example of the misidentified poem is the first above. Interestingly, the word “телефон” relatively frequent in Tarkovsky’s corpus did not help model to make a right choice.

As for the false negative cases, I have an exhaustive explanation for the second shown poem: it is full of the Tarkovsky’s frequent words I identified in the last work: “вода”, “ночь”, “свет”, and maybe colors.

Logistic Gegression Classifier with lasso Regularization - part 2

Here I briefly try to overcome the problem of size and select randomly 160 obserations from the reference set just to show that the deal is about size. The size of 160 was selected because it is the exact size of the Tarkovsky corpus.

set.seed(42)
sub <- silver[sample(c(1:3702), 160, replace = TRUE),]
tarko_and_sub <- sub %>% full_join(tarko_poems)
tarkosub_token <- tarko_and_sub %>% unnest_tokens(lem, lem)
tarkosub_token <- tarkosub_token %>% filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
tarkosubb_token = filter(tarkosub_token,!(lem %in% c(stopwords("ru"))))

Creating a document-term matrix with TF-IDF and eliminating top and bottom 2% of words: Overall, document-feature matrix is of: (1) 313 documents and (2) 481 features (95.4% sparse, what is not good).

tarko_and_sub1 <- tarkosubb_token %>% count(text, lem) %>% cast_dfm(text, lem, n) 

tarkosub_clean = tarko_and_sub1 %>% as.dfm %>%
          dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
          dfm_tfidf

Splitting data:

set.seed(42)
library(rsample)
tarkosub_split  <- initial_split(tarko_and_sub, prop=0.8, strata="set")
tarkosub_split
## <Analysis/Assess/Total>
## <258/62/320>
tarkosub_train <- training(tarkosub_split)
table(tarkosub_train$set)
## 
## silver  tarko 
##    129    129
tarkosub_test <- testing(tarkosub_split)
table(tarkosub_test$set)
## 
## silver  tarko 
##     31     31
train.dtm  <- tarkosub_clean %>% dfm_subset(docnames(tarko_and_sub1) %in% tarkosub_train$text)
test.dtm  <- tarkosub_clean %>% dfm_subset(docnames(tarko_and_sub1) %in% tarkosub_test$text)

trainY <- tarkosub_train$set
testY <- tarkosub_test$set

Model:

set.seed(42)
trainY <- trainY[1:253]
cv.lasso <- cv.glmnet(x=train.dtm, y=trainY, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.003712717
set.seed(42)
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))

table(predicted.lasso)
## predicted.lasso
## silver  tarko 
##     25     37
cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY), positive="tarko")
cm.lasso
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction silver tarko
##     silver     19     6
##     tarko      12    25
##                                          
##                Accuracy : 0.7097         
##                  95% CI : (0.5805, 0.818)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 0.0006495      
##                                          
##                   Kappa : 0.4194         
##                                          
##  Mcnemar's Test P-Value : 0.2385928      
##                                          
##             Sensitivity : 0.8065         
##             Specificity : 0.6129         
##          Pos Pred Value : 0.6757         
##          Neg Pred Value : 0.7600         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4032         
##    Detection Prevalence : 0.5968         
##       Balanced Accuracy : 0.7097         
##                                          
##        'Positive' Class : tarko          
## 
cm.lasso$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8064516            0.6129032            0.6756757 
##       Neg Pred Value            Precision               Recall 
##            0.7600000            0.6756757            0.8064516 
##                   F1           Prevalence       Detection Rate 
##            0.7352941            0.5000000            0.4032258 
## Detection Prevalence    Balanced Accuracy 
##            0.5967742            0.7096774

The classification quality indicators are much better here. Firstly, the distribution among the confusion matrix cells is more even. Further, despite the accuracy = 0.7, the other indexes - precision = 0.8, recall = 0.61, F1 = 0.73, and kappa = 0.41 shows that the model is.. fair (?).

Words & coefficients for the subset (there are many intersections with the previous table! maybe, that one was not that bad):

coef(cv.lasso, cv.lasso$lambda.min) %>%
    as.matrix %>% as.data.frame %>%
    tibble::rownames_to_column() %>%
    arrange(-abs(`1`)) %>% head(20)
##      rowname          1
## 1     солнце -1.3599981
## 2     тонкий -1.2990794
## 3       взор -1.2714516
## 4       даль -1.2630106
## 5        дух -1.1009027
## 6     судьба  1.0200812
## 7     апрель -0.9978484
## 8      мечта -0.9670253
## 9    пламень -0.9135135
## 10   дремать -0.9070201
## 11     время  0.8828687
## 12 маленький -0.8795499
## 13       миг -0.8622339
## 14     весна -0.8469621
## 15      мост  0.8397353
## 16      нога  0.8106568
## 17     поток -0.8042511
## 18   строгий -0.8039094
## 19   хороший  0.7731458
## 20      день -0.7631981

Two additional comments on this part that I thought of when it was too late to change the report:

  1. it would bу great to use 100 or even more random subsamples and take the average from the statistics they have;

  2. even a better idea is to divide the large sample of the Silver Age into approximatelly 10 distinct corpora basing on the poets’ groups: acmeists, futurists, imagists and so on. I created 5 models as the one above and the accuracy varied from 0.65 to 0.88 and the reason for that (as I guess) is deepen in the different shares of different groups. But it is to be checked in future.

Random Forest

My computer was not able to run this method on the total data, so I decided to use the subset from the previous section (yeah, not representative at all).

newmatrix <- as.matrix(tarkosub_clean)
colnames(newmatrix) = make.names(colnames(newmatrix))

Some more manipulations and the baseline level of accuracy:

newmatrix <- as.data.frame(newmatrix)
tarko_and_sub <- tarko_and_sub[1:313, 1:3]
newmatrix$set <- tarko_and_sub$set
prop.table(table(newmatrix$set))
## 
##    silver     tarko 
## 0.5111821 0.4888179

Splitting data:

set.seed(42)
split = sample.split(newmatrix$set, SplitRatio = 0.8)
trainmatrix = subset(newmatrix, split==TRUE)
testmatrix = subset(newmatrix, split==FALSE)

Running a method:

set.seed(42)
trainmatrix$set = as.factor(trainmatrix$set)
testmatrix$set = as.factor(testmatrix$set)
 
RF_model = randomForest(set ~ ., data=trainmatrix)
predictRF = predict(RF_model, newdata=testmatrix)
table(testmatrix$set, predictRF)
##         predictRF
##          silver tarko
##   silver     28     4
##   tarko       8    23
summary(testmatrix$set)
## silver  tarko 
##     32     31

A funny detail: I was a bit confused at first and started to calculate the quality indicators by hands:

paste("Accuracy = ", round((28+33)/(28+33+4+8),2))
## [1] "Accuracy =  0.84"
paste("Precision = ", round(28/(4+28), 2))
## [1] "Precision =  0.88"
paste("Recall = ", round((28/(28+8)), 2))
## [1] "Recall =  0.78"
precision = 28/(4+28)
recall = 28/(28+8)

paste("F1 =", round((2*precision*recall/(precision + recall)),2))
## [1] "F1 = 0.82"
Pobserved = (28+23)/(28+4+8+23)
Pexpected = ((28+4)*(28+8)/63 + (8+23)*(4+23)/63)/63

paste("kappa =", round((Pobserved - Pexpected)/(1 - Pexpected),2))
## [1] "kappa = 0.62"

Checking out:

accuracytest = confusionMatrix(predictRF, testmatrix$set, positive = "tarko")
accuracytest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction silver tarko
##     silver     28     8
##     tarko       4    23
##                                           
##                Accuracy : 0.8095          
##                  95% CI : (0.6909, 0.8975)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 6.935e-07       
##                                           
##                   Kappa : 0.6182          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.7419          
##             Specificity : 0.8750          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.4921          
##          Detection Rate : 0.3651          
##    Detection Prevalence : 0.4286          
##       Balanced Accuracy : 0.8085          
##                                           
##        'Positive' Class : tarko           
## 
accuracytest$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7419355            0.8750000            0.8518519 
##       Neg Pred Value            Precision               Recall 
##            0.7777778            0.8518519            0.7419355 
##                   F1           Prevalence       Detection Rate 
##            0.7931034            0.4920635            0.3650794 
## Detection Prevalence    Balanced Accuracy 
##            0.4285714            0.8084677

The model outperformed the baseline method; its coefficients are also better than for the previous model built with lasso.

imp <- importance(RF_model) %>% as.matrix %>% as.data.frame %>% tibble::rownames_to_column()
imp <- imp %>% arrange(MeanDecreaseGini)
imp[462:481, 1:2]
##     rowname MeanDecreaseGini
## 462     это        0.8392288
## 463    весь        0.8460858
## 464    день        0.8547533
## 465  нежный        0.8794229
## 466  судьба        0.9323990
## 467  солнце        0.9487465
## 468   земля        0.9605318
## 469 пламень        0.9623212
## 470    нога        0.9668824
## 471    небо        1.0240643
## 472     миг        1.0414371
## 473   время        1.1626597
## 474  любовь        1.2008082
## 475   горло        1.2009002
## 476    даль        1.2130359
## 477  сердце        1.2827029
## 478    твой        1.3571021
## 479   мечта        1.4045078
## 480    рука        1.7283319
## 481    взор        1.8062685

This list contains many repetitions as well: “солнце”, “мечта”, “взор”, etc. A general comment (for the other lists as well) is the following: their tops include really “poetic” words. Also, as the plot below shows, the 2 words - “взор” and “рука” - are distant from the others.

varImpPlot(RF_model, sort = TRUE, n.var = 20, main = "RF_model classifier & its main variables", frame.plot = FALSE, pt.cex = 1.5, bg = "#9ECE9A", lcolor = "#9ECE9A")

testpred <- predict(RF_model, newdata=testmatrix) %>% as.data.frame()
testpred$actual <- testmatrix$set
testpred$text <- rownames(testpred)
rownames(testpred) <- c(1:63)

error_1 <- testpred %>% filter(. == "silver", actual == "tarko")
error_2 <- testpred %>% filter(. == "tarko", actual == "silver")

error_1[7,3]
## [1] "Стол накрыт на шестерых — Розы да хрусталь… А среди гостей моих — Горе да печаль. И со мною мой отец, И со мною брат. Час проходит. Наконец У дверей стучат. Как двенадцать лет назад, Холодна рука, И немодные шумят Синие шелка. И вино поет из тьмы, И звенит стекло: «Как тебя любили мы, Сколько лет прошло». Улыбнется мне отец, Брат нальет вина, Даст мне руку без колец, Скажет мне она: «Каблучки мои в пыли, Выцвела коса, И звучат из-под земли Наши голоса»."
error_2[1,3]
## [1] " В смешную ванну падал друг Стена кружилася вокруг Корова чудная плыла Над домом улица была И друг мелькая на песке Ходил по комнатам в носке Вертя как фокусник рукой То левой, а потом другой Потом кидался на постель Когда в болотах коростель Чирикал шапочкой и выл Уже мой друг не в ванне был.  5 марта 1927 "

It was still not easy to understand the model logic behind each mistake but I found some cues in the poems shown above. The first one, ascribed to the Silver Age, is constructed from the not frequent words - but a person (me, at least) can predict that it was written by Arseny by the specific word “каблучки” - it appears in the other poem called “Я боюсь, что слишком поздно”:

“<…> И в моей ночи ревнивой Каблучки твои стучат <…>”

As for the second poem written by Daniil Kharms but ascribed to Tarkovsky, it contains many words frequent in the Tarkovsky’s corpus: “дом”, “друг”, etc.

Some final comments

The random forest classifier performed better than the first one with a reservation for the used data - again, it would be more interesting and fair to compare Tarkovsky to the groups of Silver Age, as it seems to me now. An obvious limitations of the second method are (1) the large model size and (2) bias in favor of higher frequency (?).

All in all, it was an insightful work. Thank you!