In this report, I seek to examine two works of different mediums that cover the same topic. I will examine: the epic novel “War and Peace” by Leo Tolstoy and the opera “Natasha, Pierre and the Great Comet of 1812”. The opera originated in 2015 and moved to Broadway on 2017, it is based off of “War and Peace”. I want to examine the sentiments of both works and see how similar or dissimilar they are through both lists of the most common positive and negative words, bi-grams, and word clouds to visualize this. Throughout this project I will utilize functions such as anti_join(), arrange(), and group_by(). I will also use ggplot() and wordcloud2() to visualize my findings.
Disclaimer: in this report I will analyze the positive and negative sentiment of words in these works. There are several “bad” words included in the negative sentiments, so please be aware as you read.
To begin, I went to the Gutenberg website, which houses a huge collection of books that are out of copyright. I then downloaded the .txt file for War and Peace. Here, I locate, import, and rename the column that the text exists in.
war_peace <- read_csv("~/Desktop/GreatComet/war&peace.txt")
colnames(war_peace)[1] <- "text"
war_peace %>%
unnest_tokens(word, text) -> peaceWords
Sentiment can be a powerful tool in seeing how positive or negative a work is. I utilized the get_sentiments() functions “bing” and “afinn”. “Bing” pulls from an pre-established lexicon and categorizes the words you give it binarily. In other words, “bing” organizes words in to positive or negative categories – below I also count negative and positive words excluding stop words. Meanwhile, “afinn” also pulls from a lexicon and assigns each word a score from -5 to 5. Negative scores denote how negative the word is and vice versa for positive words.
## # A tibble: 2 × 2
## # Groups: sentiment [2]
## sentiment n
## <chr> <int>
## 1 negative 16549
## 2 positive 12851
N-grams can be used to examine how often a set of words appear together. Here, I chose to focus on main characters: “Natasha” and “Pierre” as they are most likely to appear across both works. Please note, I chose to use bi-grams instead of any other length n-gram as it gets more difficult to analyze the longer and more specific the n-gram gets.
## # A tibble: 181 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 natásha looked 17
## 2 natásha sat 10
## 3 natásha suddenly 10
## 4 natásha ran 7
## 5 natásha natásha 6
## 6 natásha blushed 5
## 7 natásha replied 5
## 8 natásha rostóva 5
## 9 natásha smiled 5
## 10 natásha answered 4
## # … with 171 more rows
## # A tibble: 221 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 cried natásha 17
## 2 exclaimed natásha 7
## 3 natásha natásha 6
## 4 replied natásha 6
## 5 called natásha 4
## 6 door natásha 4
## 7 pierre natásha 4
## 8 told natásha 4
## 9 whispered natásha 4
## 10 addressing natásha 3
## # … with 211 more rows
## # A tibble: 236 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 pierre looked 29
## 2 pierre replied 12
## 3 pierre noticed 10
## 4 pierre sat 10
## 5 pierre continued 8
## 6 pierre left 8
## 7 pierre wished 8
## 8 pierre glanced 7
## 9 pierre listened 7
## 10 pierre stood 7
## # … with 226 more rows
## # A tibble: 382 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 monsieur pierre 16
## 2 replied pierre 11
## 3 answered pierre 8
## 4 moment pierre 8
## 5 moscow pierre 8
## 6 struck pierre 7
## 7 told pierre 7
## 8 day pierre 5
## 9 home pierre 5
## 10 question pierre 5
## # … with 372 more rows
War and Peace is a very long work with over 566,000 words. So, when I created this word cloud, I excluded character names and stop words and only included the top 300 words for ease of viewing and understanding.
At the time of creating this project, R’s Genius API was not working. That being said, I manually scraped the lyrics for the opera off of the website AZLyrics.com and put them into an excel file that I imported into R.
comet <- read_excel("~/Desktop/GreatComet/natasha, pierre and the great comet of 1812.xlsx")
comet %>%
unnest_tokens(song, lyrics) -> cometWords
colnames(cometWords)[1] <- "word"
I again utilized the get_sentiments() functions “bing” and “afinn”. This was in part because they are useful tools, and in part to keep analysis consistent throughout the report. Below you can see the number of positive and negative words in the opera minus stop words.
## # A tibble: 2 × 2
## # Groups: sentiment [2]
## sentiment n
## <chr> <int>
## 1 negative 378
## 2 positive 337
You will also notice there are far fewer bi-grams for the opera than the novel as the opera is just over 10,000 words.
Bi-grams using the names “natasha” and “pierre” were utilized again here for consistency. You will notice there are less bi-grams for analyzation than in the novel.
## # A tibble: 4 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 natasha smooth 2
## 2 natasha anatole 1
## 3 natasha cried 1
## 4 natasha natasha 1
## # A tibble: 5 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 favorite natasha 2
## 2 dear natasha 1
## 3 natasha natasha 1
## 4 tone natasha 1
## 5 true natasha 1
## # A tibble: 5 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 pierre hold 2
## 2 pierre bezukhov 1
## 3 pierre closed 1
## 4 pierre sniffed 1
## 5 pierre stand 1
## # A tibble: 5 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 ah pierre 1
## 2 awkward pierre 1
## 3 dear pierre 1
## 4 evening pierre 1
## 5 married pierre 1
When I created this word cloud, I excluded character names and stop words and only included the top 300 words for ease of viewing and understanding. This was also done for consistency across final analysis.
All of these finding have lead me to a few main conclusions.
1. While on the surface both works are more negative than positive, the novel much more negative than the opera. Not only is the novel considerably longer, the opera was designed for Broadway, making the viewer feel too “heavy” or sad upon leaving is not good for business.
2. For the set of top 15 most positive and negative words, it is hard to make any conclusions. With the exception that there is some overlap in the words “torture”, “hell”, “damn”, “hurrah”, “fun”, and “amazing”. Additionally, the reason the overlap does not appear in the same order could be the length of the works.
3. As far as the n-grams go, there is virtually no overlap. This could be due to the words chosen for the n-grams or simply the lengths of the works. In that same vein, the subtle differences in spellings in the novel versus the opera could also be to blame for this lack of results – i.e. Natásha vs Natasha or Natalie vs Nataly.
In summary, both works cover topics like war and the turmoil that it brings, so it is no surprise the overall sentiments are negative. However, looking at the most common words in the wordcloud, one may be left with a different impression. Initially I expected more overlap in the sentiment analyzation – i.e. I figured more of the bi-grams and top words would be similar or in a similar order. But because of the mediums of the works, it is not surprising that they are different.