1st Lab

Story, Libraries, and Data

The film director Andrey Tarkovsky used his father’s poems several times instead of music in his films. One of these implementations in “Zerkalo” was the first time I heard Arseny’s poetry. After the Internet research several years ago I understood that Arseny Tarkovsky (Andrey’s father) is usually described as “the last poet of the Silver Age”. This quote from another writer, Julius Halfin, has become so popular that it appears in every note on Tarkovsky’s artworks.

I was astonished to hear that: the Silver Age in Russian poetry took place between the end of the 19th and the beginning of the 20th century, while Tarkovsky was born in 1907 and died in 1989. That is a gap! So, I want to get to know whether this thesis is reliable in terms of the methods we studied in the first module.

library(readr)
library(dplyr)
library(stringr)
library(tidytext)
library(stringr)
library(stopwords)
library(ggplot2)
library(tidyverse)
library(tidylo)
library(cowplot)

For this work I parced (1) all Tarkovsky’s available poems from culture.ru and (2) the 500 selected poems of the Silver Age from slova.org.ru(I got a bit less in fact due to some mistakes with tags). There are many examples of such ratings; the chosen set contains much more poems than the other ones, so it was taken because of its size. An obvious note: as it consists of various authors, my assumption is that it is representative enough to contrast with the Takrovsky’s poetry.

tarko_poems <- read_csv("poetry.csv") %>% select(poetry)
silver_poems <- read.delim("table_silver.txt")

The numbers of poems in each set are:

numbers

##  Silver Age poems Arseny Tarkovsky poems
##  496              160

Normalization & words counting

I decided to perform lemmatization as the data is not really large and the word variations might drop the frequency significantly.

tarko_poems$lem <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = tarko_poems$poetry, stdout = TRUE)
frame_silver <- system2("mystem", c("-d", "-l", "-e cp1251", "-c"),input = silver_poems$Var1, stdout = TRUE)

silver_lem <- as.data.frame(frame_silver)

Some more preparations and tokenization:

silver_lem <- silver_lem[!apply(silver_lem == "", 1, all),]
silver_lem <- as.data.frame(silver_lem)

silver_lem$silver_lem <- as.character(silver_lem$silver_lem)
silver <- silver_lem %>% unnest_tokens(lem, silver_lem)

tarko_poems$lem <- as.character(tarko_poems$lem)
tarko <- tarko_poems %>% unnest_tokens(lem, lem) %>% select(lem)

Removing stopwords, signs, and latin characters (the last one was made in a funny way):

tarko = tarko %>%
  filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]"))

rustopwords = data.frame(words=stopwords("ru"), stringsAsFactors=FALSE)
tarko = filter(tarko,!(lem %in% c(stopwords("ru"))))

tarko <- filter(tarko, lem != "b", lem != "i", lem != "ii", lem != "iii", lem != "iv", lem != "v", lem != "iжили", lem != "iiiiгде", lem != "iiна", lem != "ivземля") %>% rbind(tarko$lem, "жили") %>% rbind(tarko$lem, "где") %>% rbind(tarko$lem, "на") %>% rbind(tarko$lem, "земля")

silver = silver %>%
  filter(!str_detect(lem, "[[:punct:]]|[[:digit:]]|[[:alpha:abcdefghijklmnopqrstuvwxyz]]"))
silver = filter(silver,!(lem %in% c(stopwords("ru"))))

Finally, I constructed the frequency lists for each group:

freq_tarko <- tarko %>% count(lem)
freq_silver <- silver %>% count(lem)

As a beginning of the investigation, I plotted the Zipf’s curve and bar chart for each set. It is interesting and even touching that some words (like “душа”) appear in both top lists. What is even more touching, the Arseny top words may be seen related to the objects from his son films.

To briefly note, the second curve is more smooth because of the original data size (nevertheless, it is rude in its second part).

tarko %>%
  dplyr::count(lem, sort = TRUE) %>%
  dplyr::mutate(rank = row_number()) %>%
  ggplot(aes(rank, n)) +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() + theme_classic() + ggtitle("Arseny Tarkovsky poetry & Zipf's curve")

silver %>%
  dplyr::count(lem, sort = TRUE) %>%
  dplyr::mutate(rank = row_number()) %>%
  ggplot(aes(rank, n)) +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() + theme_classic() + ggtitle("Silver age poets & Zipf's curve")

tarko %>%
  dplyr::count(lem) %>%
  top_n(15, n) %>%
  ggplot(aes(x = reorder(lem, n), y = n)) +
  geom_col() +
  labs(x = "word", y = "number of") + 
  coord_flip() +
  theme_classic() + ggtitle("Arseny Tarkovsky's words frequency list (top15)")

silver %>%
  dplyr::count(lem) %>%
  top_n(15, n) %>%
  ggplot(aes(x = reorder(lem, n), y = n)) +
  geom_col() +
  labs(x = "word", y = "number of") + 
  coord_flip() +
  theme_classic() + ggtitle("Silver age poets' words frequency list (top15)")

Joining frequency lists, normalization

Before the analysis, I joined data, replaced NAs with zeros, and created several variables: (1) tarko and silver present the original numbers with zeros instead of NAs, (2) tarko1 and silver1 present the numbers under the Laplace smoothing, and (3) ntarko and nsilver present the frequency after the normalization (with the multiplication by 1000).

freq_silver <- freq_silver %>% rename(silver = n)
freq_tarko <- freq_tarko %>% rename(tarko = n)

allfreq <- freq_tarko %>% full_join(freq_silver, by = "lem")
allfreq$tarko[is.na(allfreq$tarko)] <- 0
allfreq$silver[is.na(allfreq$silver)] <- 0

allfreq$tarko1 <- allfreq$tarko
allfreq$silver1 <- allfreq$silver

allfreq$tarko1 <- allfreq$tarko1 + 1
allfreq$silver1 <- allfreq$silver1 + 1

allfreq$ntarko <- allfreq$tarko1/(allfreq$tarko1+allfreq$silver1)*1000
allfreq$nsilver <- allfreq$silver1/(allfreq$tarko1+allfreq$silver1)*1000

Lists of words

Simple Math

The first idea to apply is the “simple math” introduced by Adam Kilgarriff.

allfreq$sm <- allfreq$ntarko/allfreq$nsilver

To present this list I firstly tried to sort the values of the column “simple math”. Obviously, it presented me the values higher than 1 - the words like “телефон” (sm = 8) or “криница” (sm = 9) which were unfimiliar to the poets of the Silver Age or were not widely used.

smlist <- allfreq[order(allfreq$sm, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "ntarko", "nsilver", "sm")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

head(smlist, 10)

## # A tibble: 10 x 4
##    lem         ntarko nsilver    sm
##    <chr>        <dbl>   <dbl> <dbl>
##  1 криница       900     100     9 
##  2 телефон       889.    111.    8 
##  3 глотнуть      875     125     7 
##  4 тетрадь       875     125     7 
##  5 глотать       857.    143.    6 
##  6 обжигать      857.    143.    6 
##  7 продувать     857.    143.    6 
##  8 сцена         857.    143.    6 
##  9 германн       833.    167.    5.
## 10 подставлять   833.    167.    5.

Due to that, I tried to look at the data sorted by the values of tarko:

smlist <- allfreq[order(allfreq$tarko, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "ntarko", "nsilver", "sm")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

head(smlist, 10)

## # A tibble: 10 x 4
##    lem   ntarko nsilver    sm
##    <chr>  <dbl>   <dbl> <dbl>
##  1 твой    314.    686. 0.458
##  2 свой    298.    702. 0.424
##  3 земля   383.    617. 0.622
##  4 рука    255.    745. 0.342
##  5 свет    300     700  0.429
##  6 весь    119.    881. 0.135
##  7 душа    192.    808. 0.238
##  8 это     191.    809. 0.236
##  9 дом     274.    726. 0.378
## 10 белый   277.    723. 0.383

Quite all of them were met in both sets, so the values are mostly around 0.4. The second set is bigger, so, it should be okay.

Log-likelihood

Further, I calculated the log-likelihood coefficient. For this method, I used variables tarko1 and silver1 - the ones under the Laplace but not normalized (the usage of normalized variables resulted in the lists of similar values for e, so I failed to do that in this way).

e_tarko = (sum(allfreq$tarko1)/(sum(allfreq$tarko1)+sum(allfreq$silver1)))*(allfreq$tarko1+allfreq$silver1)
e_silver = (sum(allfreq$silver1)/(sum(allfreq$tarko1)+sum(allfreq$silver1)))*(allfreq$tarko1+allfreq$silver1)

allfreq$LL = 2*(allfreq$tarko1*log(allfreq$tarko1/e_tarko) +
          allfreq$silver1*log(allfreq$silver1/e_silver))

LList <- allfreq[order(allfreq$LL, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "tarko1", "silver1", "LL")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

head(LList, 10)

## # A tibble: 10 x 4
##    lem    tarko1 silver1    LL
##    <chr>   <dbl>   <dbl> <dbl>
##  1 весь       41     303  51.7
##  2 сердце     20     198  47.0
##  3 любовь      6     121  46.2
##  4 лишь        2      86  42.3
##  5 милый       3      86  37.7
##  6 солнце      4      88  34.9
##  7 любить     15     145  33.6
##  8 знать      25     182  30.4
##  9 ль          1      55  28.5
## 10 б           8      93  25.4

This top is created with sorting by LL values. It surprised me a bit with the words “ль” and “б” - the stopwords list I used did not contained them. As these words have a sort of special aura around (at least “ль”) I have not removed them.

Pointwise Mutual Information (PMI)

Calculating the PMI:

allfreq$tarko_pmi = log2((allfreq$tarko)/sum(allfreq$tarko))/((allfreq$tarko + allfreq$silver)/(sum(allfreq$tarko)+ sum(allfreq$silver)))
allfreq$silver_pmi = log2((allfreq$silver)/sum(allfreq$silver))/((allfreq$tarko + allfreq$silver)/(sum(allfreq$tarko)+ sum(allfreq$silver)))

pmiList1 <- allfreq[order(allfreq$tarko_pmi, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "tarko", "silver", "tarko_pmi", "silver_pmi")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

pmiList2 <- allfreq[order(allfreq$silver_pmi, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "tarko", "silver", "tarko_pmi", "silver_pmi")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

head(pmiList1, 10)

## # A tibble: 10 x 5
##    lem    tarko silver tarko_pmi silver_pmi
##    <chr>  <dbl>  <dbl>     <dbl>      <dbl>
##  1 твой      96    211    -1140.     -1330.
##  2 весь      40    302    -1220.     -1113.
##  3 свой      74    176    -1480.     -1688.
##  4 рука      52    154    -1927.     -2098.
##  5 душа      38    163    -2095.     -2129.
##  6 земля     68    110    -2115.     -2573.
##  7 это       37    160    -2148.     -2180.
##  8 глаз      32    167    -2182.     -2141.
##  9 сердце    19    197    -2195.     -1914.
## 10 ночь      34    161    -2203.     -2199.

head(pmiList2, 10)

## # A tibble: 10 x 5
##    lem    tarko silver tarko_pmi silver_pmi
##    <chr>  <dbl>  <dbl>     <dbl>      <dbl>
##  1 весь      40    302    -1220.     -1113.
##  2 твой      96    211    -1140.     -1330.
##  3 свой      74    176    -1480.     -1688.
##  4 сердце    19    197    -2195.     -1914.
##  5 знать     24    181    -2225.     -2048.
##  6 рука      52    154    -1927.     -2098.
##  7 душа      38    163    -2095.     -2129.
##  8 глаз      32    167    -2182.     -2141.
##  9 это       37    160    -2148.     -2180.
## 10 ночь      34    161    -2203.     -2199.

The first list is sorted by tarko_pmi and the second one by silver_pmi.

Log odds

o1 = (allfreq$tarko1/(sum(allfreq$tarko1) - allfreq$tarko1))
o2 = (allfreq$silver1/(sum(allfreq$silver1) - allfreq$silver1))
allfreq$LO = log(o1/o2)

LO <- allfreq[order(allfreq$LO, decreasing = TRUE),] %>% top_n(allfreq, 15) %>% select("lem", "tarko1", "silver1", "LO")

## Warning in if (n > 0) {: длина условия > 1, будет использован только первый
## элемент

head(LO, 10)

## # A tibble: 10 x 4
##    lem       tarko1 silver1    LO
##    <chr>      <dbl>   <dbl> <dbl>
##  1 криница        9       1  3.15
##  2 телефон        8       1  3.03
##  3 глотнуть       7       1  2.89
##  4 тетрадь        7       1  2.89
##  5 глотать        6       1  2.74
##  6 обжигать       6       1  2.74
##  7 продувать      6       1  2.74
##  8 сцена          6       1  2.74
##  9 словарь       10       2  2.56
## 10 германн        5       1  2.56

As with the Cilgarriff’s simple math, the result is seriously affected by the words that are relatively popular in Tarkovsky’s poetry and that are totally not in Silver Age poetry.

(weighted) log odds

To calculate the weighted log odds, I firstly needed to create a new table with words, their frequency, and “mother” corpus.

fc <- tarko %>% count(lem) %>% mutate(corpus = "fc")
rc <- silver %>% count(lem) %>% mutate(corpus = "rc")
table <- rbind(fc, rc)

table <- table %>%
  bind_log_odds(corpus, lem, n)
WLO <- table[order(table$log_odds_weighted, decreasing = TRUE),]
head(WLO, 10)

## # A tibble: 10 x 4
##    lem         n corpus log_odds_weighted
##    <chr>   <int> <chr>              <dbl>
##  1 твой       96 fc                  5.43
##  2 земля      68 fc                  4.87
##  3 свой       74 fc                  4.68
##  4 трава      30 fc                  3.73
##  5 рука       52 fc                  3.73
##  6 живой      34 fc                  3.56
##  7 свет       41 fc                  3.48
##  8 вода       35 fc                  3.45
##  9 время      30 fc                  3.30
## 10 криница     8 fc                  3.25

Some general comments

Writing off much of my own confidenence in computations, results from the different measures are partly the same. The total list of typical words of Arseny Tarkovsky’s poetry defenitely should include both the most frequent words (those shown in the barchart in the beginning of the document) and those words which are frequent enough to be not accident (like “криница” that occurred 8 times without Laplace!). It was probably obvious before the work but now it seems more evident for me: in September, for example, I tended to think “the top10 or top20 is representative enough”. They are representative, with some limitations, but it is a lot better to consider the all words that are relatively more frequent.

A final remark is about filtering. It might be really helpful before considering the results of any method (I tried to filter only the “simple math”). It helps to focus on the “body” lexics.