Poetic Topic Modeling using LDA: themes of Silver Age & Soviet times

Intro, Libraries, and Data

In 2 previous laboratory papers, I analyzed the Silver Age poetry corpus (500 and 3702 observations respectively) and Arseny Tarkovsky’s poetry corpus (170 observations) and tried to understand whether it is fair to recall the Arseny as “the last poet of the Silver Age”. Here, I would compare the most distinguishable topics in Silver Age & Soviet poetry. To note, by “Soviet” I mean the actual Soviet times because Arseny’s life fell on almost the entire 20th century (1907-1989) and his poems are dated back to any decades. Thus, the idea is to prove that his contemporaries were different from their predecessors.

Despite this question is intuitively answerable - the WW2 & communist ideology seem to be the most influential facts of history - this step is necessary in this investigation.

# data wrangling
library(readr)
library(dplyr)
library(tidyverse)
library(tidylo)

# visualization
library(ggplot2)
library(kableExtra)
library(cowplot)

# dealing with text
library(tidytext)
library(stopwords)
library(stringr)

# topic model
options(java.parameters = "-Xmx1g")
library(rJava)
library(mallet)
library(LDAvis)
library(servr)


# data
silver <- read.csv("silver_new.csv", encoding = "KOI-8")
soviet <- read.csv("soviet_new.csv", encoding = "KOI-8")

Before doing this work, I collected even larger corpora than I used before. I parsed poems from culture.ru with an assumption that “bigger means better”. Also, as this site has its own system of tags (like “love”, “friendship”, etc. and, most importantly for me, “Silver Age” and “Soviet”) I beleived it is relatively easy to use this site as a source (there would be an additional comment about relying on the alien category assignments later).

Overall, the Silver Age dataset was increased by around 10,000 units! It was cool at first but in fact some problems were identified in process. I’ll briefly mention them just to make the report complete.

silver %>% count(author) %>% mutate(author = fct_reorder(author, n)) %>% top_n(20, n) %>% ggplot(aes(author, n)) + geom_bar(stat = "identity", width = 0.7, fill = "#9ECE9A") + coord_flip() + labs(title = "20 major contributors to the Silver Age dataset,", subtitle = "selected from 71 poets in total
                                                                                                                                                                                                    ", x = "", y = "", caption = "source: https://www.culture.ru/literature/poems/tag-serebryanyi-vek") + theme_bw() + geom_text(aes(label = n), check_overlap = TRUE, position = position_stack(0.5), size = 3.5)

There is still a strong dependence between the source of data and the representativeness of each writer. In the corpus I used earlier, for example, Marina Tsvetaeva was a leader (with 249 poems out of 3702 in corpus). As for the new data, I was surprised to see Valery Bryusov and Igor Severyanin on top: the person I expected to see in at least top20 contributors is Vladimir Mayakovsky - he wrote a lot before and during the 20ies as he was one of the “ЛЕФ”creators but it did not let him appear in both ratings. My hypotheses is that he was excluded from both canons as the “трубадур революции” and as the figure who was not needed for the Revolution in the late 20ies.

Other missing poets (who are not presented on the website at all) excluded from the Soviet canon are Dmitry Prigov and Alexey Parshchikov. This list might be continued.

soviet %>% count(author, tarko) %>% mutate(author = fct_reorder(author, n)) %>% top_n(20, n) %>% ggplot(aes(author, n, fill = tarko)) + geom_bar(stat = "identity", width = 0.7) + coord_flip() + labs(title = "20 major contributors to the Soviet dataset,", subtitle = "selected from 72 poets in total
                                                                                                                                                                                                    ", x = "", y = "", caption = "source: https://www.culture.ru/literature/poems/tag-sovetskie") + theme_bw() + geom_text(aes(label = n), check_overlap = TRUE, position = position_stack(0.5), size = 4) + scale_fill_manual(values = c("#E74D4D", "#9ECE9A")) + theme(legend.position = "none")

It is impossible to understand not only the poets inclusion mechanism but also the way poems were selected. It seems that some authors (like Vladimir Vysotsky) are included totally, while the work of others was filtered many times. The funniest example I found is that the only “Soviet” poem written by Osip Mandelstam is called “Ода Сталину”.
When observed this strong disparity in both data collections, I found some poems repetitions as well. This is another site problem: many poems are repeated with slight changes in their names (name, another name, first row - I suppose it was done to make the user search easier - Yandex has a small article about poetry search) or text bodies (like another dash symbol or some more dots). It was not easy but I removed this iterations.

My general and final note about poetry text mining would be that it is tricky - probably, the solution is to collect each poet separately from the web pages dedicated to the individual poets.

Some more words about the data & pre-processing

The tags I mentioned above work as follows: one poem = one tag. That is strange and it excludes many possible observations (why the poem tagged “О любви” cannot be ascribed to “Серебряный век”?). Anyway, this strangeness is the limitation of this analysis cored in the nature of the data source.

Despite the tag loneliness, there are intersections between the corpora: (1) some poets appear in both of them (Severyanin, Mayakovsky, Aseev, Kharms) and (2) some inheritance of traditions and styles might be identified (for instance, Mikhail Svetlov in Soviet data is a well-known Mayakovsky’s successor). So, with all its strangeness the corpus I got captures the historical & cultural processes and relates the successive eras.

Thinking of pre-processing, I read article about the topic modeling in the Spanish Golden Age poetry. As pre-processing is one of the parameters that determine the outcome topics, the author ran several tests and found out that the best performance was achieved with a stop-words filtering without lemmatization. I decided to do the same procedure with removing some noise. But of course, Spanish and Russian are from different language families, so this result should be treated with caution.

silver$period <- "silver"
soviet$period <- "soviet"
soviet <- soviet %>% select(author, title, poetry, period) %>% filter(author != "Арсений Тарковский")
all_poems <- soviet %>% full_join(silver)

A fact I thought of after removing English letters is that many poets used them in specific words and maybe it would be wiser to remove the particular tags.

silver_tokens <- silver %>% unnest_tokens(token, poetry)
soviet_tokens <- soviet %>% unnest_tokens(token, poetry)
rustopwords = data.frame(words=stopwords("ru"), stringsAsFactors=FALSE)

soviet_tokens = soviet_tokens %>%
  filter(!str_detect(token, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
soviet_tokens = filter(soviet_tokens,!(token %in% c(stopwords("ru"))))

silver_tokens = silver_tokens %>%
  filter(!str_detect(token, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
silver_tokens = filter(silver_tokens,!(token %in% c(stopwords("ru"))))

all_poems = all_poems %>% mutate(poetry = str_remove_all(poetry, "1|2|3|4|5|6|7|8|9|0"))
all_poems = all_poems %>% mutate(poetry = str_remove_all(poetry, "[[:punct:]]"))
all_poems = all_poems %>% mutate(poetry = str_remove_all(poetry, "[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))

one <- silver_tokens %>% count(token) %>% top_n(15, n) %>% ggplot(aes(x = reorder(token, n), y = n)) + geom_col(fill = "#9ECE9A") + coord_flip() + labs(title = "Silver Age top-words,", subtitle = "On the basis of 897,454 words", x = "", y = "") + theme_bw()

two <- soviet_tokens %>% count(token) %>% top_n(15, n) %>% ggplot(aes(x = reorder(token, n), y = n)) + geom_col(fill = "#E74D4D") + coord_flip() + labs(title = "Soviet period top-words,", subtitle = "On the basis of 503,335 words", x = "", y = "") + theme_bw()

plot_grid(one, two, nrow = 1, ncol = 2)

Before launching a model, I was curious about the most frequent words in both datasets and whether there would be some words missing in the total frequency list. In fact, most of the words appear in all 3 frequency lists with some permutations. A funny thing I noticed is how particle ль was removed with the particle б.

all_tokens <- all_poems %>% unnest_tokens(token, poetry)

all_tokens %>% count(token) %>% top_n(15, n) %>% ggplot(aes(x = reorder(token, n), y = n)) + geom_col(fill = "#6874E8") + coord_flip() + labs(title = "Silver Age & Soviet poetry top-words,", subtitle = "On the basis of 1,400,789 words", x = "", y = "") + theme_bw()

Another parameter to consider is the text length. Its average meanings are presented in a table below:

"period" <- c("Silver Age", "Soviet period")
"mean number of words" <- c(round(mean(sapply(strsplit(silver$poetry, " "), length)),2), round(mean(sapply(strsplit(soviet$poetry, " "), length)),2))
"lowest number of words" <- c(min(sapply(strsplit(soviet$poetry, " "), length)), min(sapply(strsplit(silver$poetry, " "), length)))
"biggest number of words" <- c(max(sapply(strsplit(soviet$poetry, " "), length)), max(sapply(strsplit(silver$poetry, " "), length)))


data.frame(period, `mean number of words`, `lowest number of words`, `biggest number of words`) %>% kable %>% kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

period	mean.number.of.words	lowest.number.of.words	biggest.number.of.words
Silver Age	105.51	4	7434
Soviet period	144.93	4	7178

The distribution broadens the picture:

all_poems$length <- sapply(strsplit(all_poems$poetry, " "), length)

all_poems %>% count(length, period) %>% mutate(rank = row_number()) %>% ggplot(aes(rank, n)) + geom_line(aes(color = period), size = 0.5) + scale_color_manual(values = c("#E74D4D", "#9ECE9A")) + theme_bw() + labs(title = "The distribution of poems' lengths in words,", subtitle = " 19,040 poems in total
                                                                                                                                                                                                                     ", x = "number of words", y = "number of poems", color = "Period") + theme(legend.key = element_rect(fill = "#E7EAEE", color = NA))

Taking the difference between sample sizes (897,454 Silver Age poems and 503,335 Soviet poems) the difference in mode might be normal. Probably, it was necessary to somehow redivide & connect the large and small poems but my experiments failed - the topics became less identifiable. Maybe, that is also a feature of the poems as the material for topic modeling: poems are “aggregated expressions” and it is hard to change their initial compositions.

I decided to drop the poems which are smaller than 25 words. Some extreme demonstrative examples of the short poems:

“О закрой свои бледные ноги.” by Valery Bryusov,
“Кто с утра сегодня пьян? Лев Суреныч Кочарян!”, by Vladimir Vysotsky.

As it is seen, these poems definitely cover particular topics but it is hard to find the other small poems which would not affect these topics.

subset1 <- all_poems %>% filter(length < 25)
subset1 %>% count(period) %>% kable %>% kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

period	n
silver	453
soviet	306

Again, the difference between the numbers of poems in each group is explained by the entire difference between corpora.

all_poems <- all_poems %>% anti_join(subset1)

subset1 %>% count(author, period) %>% mutate(author = fct_reorder(author, n)) %>% top_n(10, n) %>% ggplot(aes(author, n, fill = period)) + geom_bar(stat = "identity") + coord_flip() + scale_fill_manual(values = c("#9ECE9A", "#E74D4D")) + theme_bw() + labs(title = "10 poets with the largest number of dropped poems", subtitle = "", x = "", y = "")

Finally, the topic modeling starts here. Data now consists of 18,281 poems.

LDA

Model

Firstly, I add the order variable and let mallet to collect the required information:

all_poems <- all_poems %>% mutate(sent_id = row_number())

mallet.instances <- mallet.import(id.array=as.character(all_poems$sent_id),
                                  text.array=all_poems$poetry,
                                  stoplist.file="stopwords.txt")

Speaking about the number of topics, I made about 8 trials with the values in range between 10 and 100. I still have a feeling that the top value is not the limit: there were still mixed top-words for some topics. To put this comment in a broader context, this parameter in the article on Spanish poetry I mentioned above is equal to 100. The author referred to the number of topics as the “value in a range between 50 and 100” and filtered 19 outcome LDA-topics classified as “noise” (in the end, 81 topics were considered).

As for the my data, I am not sure whether 100 topics is enough:

my sample size is much larger than the one used by Borja Navarro-Colorado;
my sample combines 2 distinct (historically, culturally, politically, etc.) eras,

so I needed to increase the number of topics or to look at the corpora separately. On the other hand, I found some logic behind only 15 topics (“грубые мазки”) and decided to follow in this direction as the easiest (though less insightful) way.

For the hyperparameters’ iterations & burn-in iterations, I used the default values:

topic.model <- MalletLDA(num.topics=15)
topic.model$model$setRandomSeed(123L)
topic.model$loadDocuments(mallet.instances) 
topic.model$setAlphaOptimization(20, 50)

Constructing the corpus dictionary & counting words:

vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

Model training with the default values as well:

topic.model$train(500)
topic.model$maximize(10)

Looking at the topics

topic.model$model$setRandomSeed(123L)
doc.topics <- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.labels <- mallet.topic.labels(topic.model, topic.words, 10)

for (k in 1:nrow(topic.words)) {
    top <- paste(mallet.top.words(topic.model, topic.words[k,], 10)$words,collapse=" ")
    cat(paste(k, top, "\n"))
}

## 1 коня конь поле князь кони русь крови ворон вся коней 
## 2 царь бог кровь дух камень меч века крови землю прах 
## 3 ветер солнце море небо ночь словно ночи сквозь небе звезды 
## 4 ах весна цветы сирень сад розы девушка роза вся две 
## 5 день мир час лишь свой жизни всё нам свет вновь 
## 6 глаза руки ночь шел долго видел помню лицо дом это 
## 7 нам наш день весь дело стал пять мол ещё брат 
## 8 мама это говорит дом спать сидит кошка кот девочка мальчик 
## 9 u e въ россии d онъ m s души власть 
## 10 снег город снега снегу мимо сквозь вдоль весь зима стук 
## 11 мир земли бог человека дух плоть вселенной стал среди огонь 
## 12 нам наш день наши пусть бой нами земли войны земле 
## 13 это всё нам люди лет лишь б просто время свете 
## 14 тебе сердце твой пусть любовь знаю друг моей тобой любви 
## 15 царь поэт сей всем свой король стих шкура пушкин ваш

top.docs <- function(doc.topics, topic, docs, top.n=10) {
    head(docs[order(-doc.topics[,topic])], top.n)
}

Here we are. I classified this list as follows:

1 - Unexpected topic related to horses as the source of moving. This worf appears in definitely the main one and it appears in different contexts (from moving in the city street to moving to the fight). For example, the text outlined with this topic is “Песнь о Евпатии Коловрате” written by Esenin.

2 - about the peasant life, including work on the ground, faith, Motherland, and all the bitterness and adversity. The example comes from Voloshin:

“Как земледел над грудой веских зерен, Отобранных к осеннему посеву, Склоняется, обеими руками Зачерпывая их, и весит в горсти, Чуя Их дух, их теплоту и волю к жизни, И крестит их, — так я, склонясь над Русью, Крещу ее — от лба до поясницы, От правого до левого плеча: И, наклонясь, коленопреклоненно Целую средоточье всех путей — Москву.”

3, 10 - nature and its beauty, from ling on the forest grass to the skies. Mostly, it is about forest and night.

4 - spring! and all what is related to it: love, various flowers (“васильки”, “фиалки”, “сирень”, etc.), birds’ songs and even months (April, May)! Excellent examples are “Поэма весны” written by Nikolay Zabolotsky and “Был май” written by Igor Severyanin.

5 - hard to interpret, maybe nature or love (according to the poems text) or something “philosophical”.

6 - leisure: tea drinking, moving around in the garden, etc.

7 - labor: in the text bodies there are many references to the work which should be done by somebody (usually called with occupation: “шофер”, “столяр”, “товарищи ученые”).

8 - the place of residence including inhabitants (animals, relatives, small children) and its location. What is remarkable, it includes the translation of “The House that Jack Built” by Samuel Marshak.

9 - hard to interpret; also, something bad with this separate letters.

11 - religion and mysticism, from Christianity to something more exotic. That’s funny. One more example from Voloshin:

“На отмели Незнаемого моря Синдбад-скиталец подобрал бутылку, Заклепанную Соломоновой печатью, И, вскрыв ее, внезапно впал во власть В ней замкнутого яростного Джинна.”

12 - state and its symbols: “Русь”, “Война”, “Ленин”, “Кремль”.

13 - growing and “being a good person”: this group is mostly Soviet one, the examples are “Парня спасем, парня в детдом” and “Здравствуйте…” written by Vladimir Vysotsky.

14 - finally, love lyrics, to the partner or to the child.

15 - warfare, the fight or fighters descriptions. The last example comes from Demyan Bedny:

“Латыш хорош без аттестации. Таков он есть, таким он был: Не надо долгой агитации, Чтоб в нем зажечь геройский пыл.Скажи: «барон!» И, словно бешеный, Латыш дерется, всё круша.”

Interactive visualization

figure 1

The picture depicts a model with 15 LDA-topics. I tried to extract something valueable from the location of the dataset (for example, I still wonder why the topics from 9-15 are so close) but nothing was noticed. As for the words-list, it is funny to play with but I preferred to use top.docs(doc.topics, 1, all_poems$poetry) to test my assumptions.

Topics & periods

The next table allows to compare the shares of topics among categories. This measure is the simplest among all but it still describes how our data is related to topics:

topics15 <- as.data.frame(doc.topics)
topics15 <- topics15 %>% mutate(sent_id = row_number())
topics15 <- topics15 %>% full_join(all_poems)

silver_topics <- topics15 %>% filter(period == "silver")
soviet_topics <- topics15 %>% filter(period == "soviet")

"Topic" <- c("horses", "peasant life", "nature & its beauty", "spring", "incomprehensible",  "leisure", "labor", "place of residence", "incomprehensible", "nature & its beauty", "religion & misticism", "state & its metaphors", "being a good person", "love lyrics", "warfare")
"Silver Age ratio" <- c(round((sum(silver_topics$V1) / 12715), 2), round((sum(silver_topics$V2) / 12715), 2), round((sum(silver_topics$V3) / 12715), 2), round((sum(silver_topics$V4) / 12715), 2), round((sum(silver_topics$V5) / 12715), 2), round((sum(silver_topics$V6) / 12715), 2), round((sum(silver_topics$V7) / 12715), 2), round((sum(silver_topics$V8) / 12715), 2), round((sum(silver_topics$V9) / 12715), 2), round((sum(silver_topics$V10) / 12715), 2), round((sum(silver_topics$V11) / 12715), 2), round((sum(silver_topics$V12) / 12715), 2), round((sum(silver_topics$V13) / 12715), 2), round((sum(silver_topics$V14) / 12715), 2), round((sum(silver_topics$V15) / 12715), 2))

"Soviet times ratio" <- c(round((sum(soviet_topics$V1) / 5566), 2), round((sum(soviet_topics$V2) / 5566), 2), round((sum(soviet_topics$V3) / 5566), 2), round((sum(soviet_topics$V4) / 5566), 2), round((sum(soviet_topics$V5) / 5566), 2), round((sum(soviet_topics$V6) / 5566), 2), round((sum(soviet_topics$V7) / 5566), 2), round((sum(soviet_topics$V8) / 5566), 2), round((sum(soviet_topics$V9) / 5566), 2), round((sum(soviet_topics$V10) / 5566), 2), round((sum(soviet_topics$V11) / 5566), 2), round((sum(soviet_topics$V12) / 5566), 2), round((sum(soviet_topics$V13) / 5566), 2), round((sum(soviet_topics$V14) / 5566), 2), round((sum(soviet_topics$V15) / 5566), 2))

data.frame(`Topic`, `Silver Age ratio`, `Soviet times ratio`) %>% kable %>% kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Topic	Silver.Age.ratio	Soviet.times.ratio
horses	0.02	0.02
peasant life	0.04	0.01
nature & its beauty	0.17	0.12
spring	0.04	0.02
incomprehensible	0.25	0.07
leisure	0.07	0.07
labor	0.02	0.08
place of residence	0.03	0.06
incomprehensible	0.01	0.01
nature & its beauty	0.03	0.05
religion & misticism	0.02	0.01
state & its metaphors	0.04	0.09
being a good person	0.06	0.19
love lyrics	0.18	0.16
warfare	0.02	0.02

A funny couple “labor” & “leisure” shows one of the important distinctions: labor as the activity was a concrete Soviet narrative that was cultivated through the poetry as well as through the other mediums (to note, the share of “leisure” did not change). Another Soviet narrative is “being a good person” - of course, the poets before used to think about personal growth and the related difficulties but Soviet poets have created a distinct genre out of such moralizing. The last Soviet topic from this list is “state & its metaphors”: again, before the Revolution state also participated in public discourse but in Soviet times this discourse became extraordinary.

Unfortunately, the topics mostly typical for the Silver Age were left incomprehensible. Without any doubts, only the “peasant life” topic is more related to this period. That was less expected than the disbalance between “warfare”, “religion & misticism”, and “nature & its beauty”.

Also, it is funny to see the strange “horses” topic in balance among the groups.

trydf <- data.frame(`Topic`, `Silver Age ratio`, `Soviet times ratio`)

trydf$color1 <- "silver"
trydf$color2 <- "soviet"
trydf %>% ggplot() + geom_col(aes(Topic, Silver.Age.ratio, fill = color1)) + geom_col(aes(Topic, -Soviet.times.ratio, fill = color2)) + coord_flip() + theme_bw() + labs(title = "Topics' distributions among the corpora,", subtitle = "with joined categories for 'nature & its beauty' and 'incomprehensible'
                                                                                                                                                                         ", x = "", y = "", fill = "period") + scale_fill_manual(values = c("#9ECE9A", "#E74D4D"))

This graph is just the illustration to the previous table.

Conclusion

After all of these, I want to summarize the main ideas & questions to think of:

Data is responsible for all the further work. It should be uniform & representative. To rely on some web source is not good.
It seems that lemmatization is not necessary for the topic modeling, it is enough to filter stop-words.
How small and large poems should be treated? Probably, to remove the small ones is not a serious mistake, while the large ones might be divided into several texts.
For the poetic topic modeling based on the data with more than 5,000 observations, more than 100 LDA-topics might be extracted.
In this paper, the tiny number of topics resulted in partly meaningful list of themes (with 2 out of 15 topics incomprehensible). To get illegible topics is okay, they just can be removed for the further analysis.

Thanks!

3 Lab

Arthur Pecherskikh

14 12 2020