An Exploration of Similar Works

In this report, I seek to examine two works of different mediums that cover the same topic. I will examine: the epic novel “War and Peace” by Leo Tolstoy and the opera “Natasha, Pierre and the Great Comet of 1812”. The opera originated in 2015 and moved to Broadway on 2017, it is based off of “War and Peace”. I want to examine the sentiments of both works and see how similar or dissimilar they are through both lists of the most common positive and negative words, bi-grams, and word clouds to visualize this. Throughout this project I will utilize functions such as anti_join(), arrange(), and group_by(). I will also use ggplot() and wordcloud2() to visualize my findings.

Disclaimer: in this report I will analyze the positive and negative sentiment of words in these works. There are several “bad” words included in the negative sentiments, so please be aware as you read.

War and Peace

To begin, I went to the Gutenberg website, which houses a huge collection of books that are out of copyright. I then downloaded the .txt file for War and Peace. Here, I locate, import, and rename the column that the text exists in.

war_peace <- read_csv("~/Desktop/GreatComet/war&peace.txt")
colnames(war_peace)[1] <- "text"

war_peace %>% 
  unnest_tokens(word, text) -> peaceWords

War and Peace Word Sentiments

Sentiment can be a powerful tool in seeing how positive or negative a work is. I utilized the get_sentiments() functions “bing” and “afinn”. “Bing” pulls from an pre-established lexicon and categorizes the words you give it binarily. In other words, “bing” organizes words in to positive or negative categories – below I also count negative and positive words excluding stop words. Meanwhile, “afinn” also pulls from a lexicon and assigns each word a score from -5 to 5. Negative scores denote how negative the word is and vice versa for positive words.

## # A tibble: 2 × 2
## # Groups:   sentiment [2]
##   sentiment     n
##   <chr>     <int>
## 1 negative  16549
## 2 positive  12851

War and Peace N-Grams

N-grams can be used to examine how often a set of words appear together. Here, I chose to focus on main characters: “Natasha” and “Pierre” as they are most likely to appear across both works. Please note, I chose to use bi-grams instead of any other length n-gram as it gets more difficult to analyze the longer and more specific the n-gram gets.

## # A tibble: 181 × 3
##    word1   word2        n
##    <chr>   <chr>    <int>
##  1 natásha looked      17
##  2 natásha sat         10
##  3 natásha suddenly    10
##  4 natásha ran          7
##  5 natásha natásha      6
##  6 natásha blushed      5
##  7 natásha replied      5
##  8 natásha rostóva      5
##  9 natásha smiled       5
## 10 natásha answered     4
## # … with 171 more rows
## # A tibble: 221 × 3
##    word1      word2       n
##    <chr>      <chr>   <int>
##  1 cried      natásha    17
##  2 exclaimed  natásha     7
##  3 natásha    natásha     6
##  4 replied    natásha     6
##  5 called     natásha     4
##  6 door       natásha     4
##  7 pierre     natásha     4
##  8 told       natásha     4
##  9 whispered  natásha     4
## 10 addressing natásha     3
## # … with 211 more rows
## # A tibble: 236 × 3
##    word1  word2         n
##    <chr>  <chr>     <int>
##  1 pierre looked       29
##  2 pierre replied      12
##  3 pierre noticed      10
##  4 pierre sat          10
##  5 pierre continued     8
##  6 pierre left          8
##  7 pierre wished        8
##  8 pierre glanced       7
##  9 pierre listened      7
## 10 pierre stood         7
## # … with 226 more rows
## # A tibble: 382 × 3
##    word1    word2      n
##    <chr>    <chr>  <int>
##  1 monsieur pierre    16
##  2 replied  pierre    11
##  3 answered pierre     8
##  4 moment   pierre     8
##  5 moscow   pierre     8
##  6 struck   pierre     7
##  7 told     pierre     7
##  8 day      pierre     5
##  9 home     pierre     5
## 10 question pierre     5
## # … with 372 more rows

War and Peace Wordcloud

War and Peace is a very long work with over 566,000 words. So, when I created this word cloud, I excluded character names and stop words and only included the top 300 words for ease of viewing and understanding.

Natasha, Pierre and the Great Comet of 1812

At the time of creating this project, R’s Genius API was not working. That being said, I manually scraped the lyrics for the opera off of the website AZLyrics.com and put them into an excel file that I imported into R.

comet <- read_excel("~/Desktop/GreatComet/natasha, pierre and the great comet of 1812.xlsx")

comet %>% 
  unnest_tokens(song, lyrics) -> cometWords

colnames(cometWords)[1] <- "word"

Natasha, Pierre and the Great Comet of 1812 Sentiments

I again utilized the get_sentiments() functions “bing” and “afinn”. This was in part because they are useful tools, and in part to keep analysis consistent throughout the report. Below you can see the number of positive and negative words in the opera minus stop words.

## # A tibble: 2 × 2
## # Groups:   sentiment [2]
##   sentiment     n
##   <chr>     <int>
## 1 negative    378
## 2 positive    337

You will also notice there are far fewer bi-grams for the opera than the novel as the opera is just over 10,000 words.

Natasha, Pierre and the Great Comet of 1812 N-Grams

Bi-grams using the names “natasha” and “pierre” were utilized again here for consistency. You will notice there are less bi-grams for analyzation than in the novel.

## # A tibble: 4 × 3
##   word1   word2       n
##   <chr>   <chr>   <int>
## 1 natasha smooth      2
## 2 natasha anatole     1
## 3 natasha cried       1
## 4 natasha natasha     1
## # A tibble: 5 × 3
##   word1    word2       n
##   <chr>    <chr>   <int>
## 1 favorite natasha     2
## 2 dear     natasha     1
## 3 natasha  natasha     1
## 4 tone     natasha     1
## 5 true     natasha     1
## # A tibble: 5 × 3
##   word1  word2        n
##   <chr>  <chr>    <int>
## 1 pierre hold         2
## 2 pierre bezukhov     1
## 3 pierre closed       1
## 4 pierre sniffed      1
## 5 pierre stand        1
## # A tibble: 5 × 3
##   word1   word2      n
##   <chr>   <chr>  <int>
## 1 ah      pierre     1
## 2 awkward pierre     1
## 3 dear    pierre     1
## 4 evening pierre     1
## 5 married pierre     1

Natasha, Pierre and the Great Comet of 1812 Wordcloud

When I created this word cloud, I excluded character names and stop words and only included the top 300 words for ease of viewing and understanding. This was also done for consistency across final analysis.

Conslusions

All of these finding have lead me to a few main conclusions.
1. While on the surface both works are more negative than positive, the novel much more negative than the opera. Not only is the novel considerably longer, the opera was designed for Broadway, making the viewer feel too “heavy” or sad upon leaving is not good for business.
2. For the set of top 15 most positive and negative words, it is hard to make any conclusions. With the exception that there is some overlap in the words “torture”, “hell”, “damn”, “hurrah”, “fun”, and “amazing”. Additionally, the reason the overlap does not appear in the same order could be the length of the works.
3. As far as the n-grams go, there is virtually no overlap. This could be due to the words chosen for the n-grams or simply the lengths of the works. In that same vein, the subtle differences in spellings in the novel versus the opera could also be to blame for this lack of results – i.e. Natásha vs Natasha or Natalie vs Nataly.
In summary, both works cover topics like war and the turmoil that it brings, so it is no surprise the overall sentiments are negative. However, looking at the most common words in the wordcloud, one may be left with a different impression. Initially I expected more overlap in the sentiment analyzation – i.e. I figured more of the bi-grams and top words would be similar or in a similar order. But because of the mediums of the works, it is not surprising that they are different.