Text won’t be tidy at all stages of an analysis, and it is important to be able to convert back and forth between tidy and non-tidy formats.(Silge and Robinson 2018)

Computer Assisted Text analytics means much more than counting words. In particular, the combination of pattern-based and complex statistical approaches may be applied to support established qualitative data analysis designs and open them to a quantitative perspective.(Wiedemann 2016)

INTRODUCTION

This Vignette explains a possible approach to do sentinmental analysis in a literary piece of work using Tidy Text. Based on the genre of a literary piece of work, can we say the sentiments conveyed are also the same?

What that means is, do Tragedies have words associated to tragic emotions? Do Comedies have words associated with comical emotions? If so, what are those words and sentiments?

To find out what sentiments are conveyed based on the genre of a literary work, I chose the Tragedies and Comedies of William Shakespeare to see if the words used in these plays are really assoicated to the genre they are classified in.

CHOOSING THE PLAYS

The following works of Shakespeare were selected from the Project Gutenberg collection(https://www.gutenberg.org/).

Tragedies:

Antony and Cleopatra
Hamlet
Julius Caesar
Macbeth Othello

Comedies:

A Midsummer Night’s Dream
Measure for Measure
The Comedy of Errors
The Tempest
As You Like It

PROCESS AND STEPS

Below are the steps to discover the sentiments conveyed in these plays. Let’s find out.

Step 1: Initialise the required packages.

library(dplyr)

library(stringr)

library(tidytext)

library(gutenbergr)

library(ggplot2)

library(tidyverse)

library(jtools)

library(grid)

library(gridExtra)

library(ggplotify)

library(wordcloud)

Customise the ggplot2 theme

my_theme <- function() {
  theme_apa(legend.pos = "none") +
    theme(panel.background = element_blank()) +
    theme(plot.background = element_rect(fill = "antiquewhite1")) +
    theme(panel.border = element_blank()) +                       # facet border
    theme(strip.background = element_blank()) +                  # facet title background
    theme(plot.margin = unit(c(.5, .5, .5, .5), "cm")) 
}

The gutenbergr package includes tools for downloading books and the complete dataset of Project Gutenberg metadata which can be used to find works of interest.

Step 2: Check the metadata fields of Gutenberg works and see the avaiable columns and how the metadata is structured .

gutenberg_metadata
## # A tibble: 51,997 x 8
##    gutenberg_id title author gutenberg_autho~ language gutenberg_books~
##           <int> <chr> <chr>             <int> <chr>    <chr>           
##  1            0 <NA>  <NA>                 NA en       <NA>            
##  2            1 The ~ Jeffe~             1638 en       United States L~
##  3            2 "The~ Unite~                1 en       American Revolu~
##  4            3 John~ Kenne~             1666 en       <NA>            
##  5            4 "Lin~ Linco~                3 en       US Civil War    
##  6            5 The ~ Unite~                1 en       American Revolu~
##  7            6 Give~ Henry~                4 en       American Revolu~
##  8            7 The ~ <NA>                 NA en       <NA>            
##  9            8 Abra~ Linco~                3 en       US Civil War    
## 10            9 Abra~ Linco~                3 en       US Civil War    
## # ... with 51,987 more rows, and 2 more variables: rights <chr>,
## #   has_text <lgl>

We see there are over 50,000 titles available from the Gutenberg library. How do we download the book of our choice?

Step 3: As an example, let’s look at a book of our choice - Julius Caesar.

gutenberg_metadata %>%
  filter(title == "Julius Caesar")
## # A tibble: 6 x 8
##   gutenberg_id title author gutenberg_autho~ language gutenberg_books~
##          <int> <chr> <chr>             <int> <chr>    <chr>           
## 1         1522 Juli~ Shake~               65 en       <NA>            
## 2         1785 Juli~ Shake~               65 en       <NA>            
## 3         2263 Juli~ Shake~               65 en       <NA>            
## 4         9875 Juli~ Shake~               65 de       DE Drama        
## 5        18512 Juli~ Shake~               65 fi       <NA>            
## 6        46768 Juli~ Shake~               65 la       <NA>            
## # ... with 2 more variables: rights <chr>, has_text <lgl>

Notice that the book is available in multiple versions in multiple languages. To download specific titles, filter by Title and note the gutenberg_id of the version you want to download. The gutenberg_id for Julius Caesar is 1522.

Step 4: Download Julius Caesar:

Julius_Caesar <- gutenberg_download(1522)

Julius_Caesar
## # A tibble: 4,637 x 2
##    gutenberg_id text                  
##           <int> <chr>                 
##  1         1522 JULIUS CAESAR         
##  2         1522 ""                    
##  3         1522 by William Shakespeare
##  4         1522 ""                    
##  5         1522 ""                    
##  6         1522 ""                    
##  7         1522 ""                    
##  8         1522 PERSONS REPRESENTED   
##  9         1522 ""                    
## 10         1522 JULIUS CAESAR         
## # ... with 4,627 more rows

Step 5: Now that we know how to access the Gutenberg library and download books of our choice, let’s continue with our Sentiment Analysis and download the Comedies and Tragedies we need for our analysis.

Let’s download Comedies first. As pointed out earlier, each play has an ID assoicated to them. And those ID’s are what we need to use for downloading.

comedies <- gutenberg_download(c(1504, 1540, 1530, 1523, 1514), meta_fields = "title")

Step 6: Check if the Comedies have downloaded correctly.

comedies %>%
  count(title)
## # A tibble: 5 x 2
##   title                         n
##   <chr>                     <int>
## 1 A Midsummer Night's Dream  3459
## 2 As You Like It             4530
## 3 Measure for Measure        4905
## 4 The Comedy of Errors       3194
## 5 The Tempest                3888

To work as a tidy dataset, data needs to be restructured to one-token-per-row format. This is done using the function unnest_tokens().

It breaks the text into individual tokens. A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of breaking the text into tokens.

Step 7: Split the original text into Tokens using the function unnest_tokens()

tidy_comedies <- comedies %>%
  unnest_tokens(word, text)

tidy_comedies
## # A tibble: 97,218 x 3
##    gutenberg_id title                word       
##           <int> <chr>                <chr>      
##  1         1504 The Comedy of Errors the        
##  2         1504 The Comedy of Errors comedy     
##  3         1504 The Comedy of Errors of         
##  4         1504 The Comedy of Errors errors     
##  5         1504 The Comedy of Errors by         
##  6         1504 The Comedy of Errors william    
##  7         1504 The Comedy of Errors shakespeare
##  8         1504 The Comedy of Errors persons    
##  9         1504 The Comedy of Errors represented
## 10         1504 The Comedy of Errors solinus    
## # ... with 97,208 more rows

Text analysis requires Stop Words to be removed. Stop Words are Words that don’t mean anything or are not useful for any analysis. Such as “the”, “of,”to“…etc.

Step 8: Remove the Stop Words with this simple line of code.

data(stop_words)

tidy_comedies <- tidy_comedies %>%
  anti_join(stop_words)

Step 9: Having cleaned our data from Stop Words, let’s use dplyr’s count() function to find the most common words in our list of selected comedies.

What are the most common words in the Comedies of Shakespeare?

tidy_comedies %>%
  count(word, sort = TRUE)
## # A tibble: 7,919 x 2
##    word         n
##    <chr>    <int>
##  1 thou       690
##  2 duke       429
##  3 sir        391
##  4 thee       364
##  5 thy        350
##  6 love       285
##  7 rosalind   279
##  8 enter      248
##  9 syracuse   227
## 10 dromio     222
## # ... with 7,909 more rows

The word thou takes the top spot followed by duke, sir and thee.

Step 10: Plot a graph to see the common words in these Comedies.

tidy_comedies %>%
  count(word, sort = TRUE) %>%
  filter(n > 300) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

The tidytext package contains several sentiment lexicons in the sentiments dataset. Words are assigned to specific sentiments which in turn are associated to a lexicon with a certain score for positive or negative sentiment including emotions such as joy, sadness, disgust, fear, surprise, trust…etc.

sentiments
## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

There are three general purpose lexicons:

AFINN from Finn Arup Nielsen,
bing from Bing Liu and collaborators,
nrc from Saif Mohammad and Peter Turney.

Tidytext provides a function get_sentiment() to get specific sentiment lexicons without the columns that are not used in that lexicon.

All these Lexicons can be accessed using the tidytext function get_sentiment() to get specific sentiment lexicons.

AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows

bing lexicon categorizes words in a binary fashion into positive and negative categories.

get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

nrc lexicon categorizes words in a binary fashion (yes or no) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Now that we have a way to associate emotions to words, we can figure out the sentiment associated to each of the selected Comedies.

Step 12: Let’s look at the overall sentiment of the 5 Comedies we have chosen.

tidy_comedies <- comedies %>%
  group_by(title) %>%
  mutate(gutenberg_id = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

sentiments_check <- get_sentiments("nrc")

sentiments_check
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

nrc_joy
## # A tibble: 689 x 2
##    word          sentiment
##    <chr>         <chr>    
##  1 absolution    joy      
##  2 abundance     joy      
##  3 abundant      joy      
##  4 accolade      joy      
##  5 accompaniment joy      
##  6 accomplish    joy      
##  7 accomplished  joy      
##  8 achieve       joy      
##  9 achievement   joy      
## 10 acrobat       joy      
## # ... with 679 more rows
tidy_comedies %>%
#filter(title == "The Comedy of Errors") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 287 x 2
##    word      n
##    <chr> <int>
##  1 good    330
##  2 love    285
##  3 art     126
##  4 sweet    97
##  5 true     97
##  6 pray     96
##  7 clown    76
##  8 marry    55
##  9 young    46
## 10 youth    46
## # ... with 277 more rows
library(tidyr)

# Subtracting the number of negative words from the Positive. Othello appears to have the most
# number of negative words.

comedies_sentiment <- tidy_comedies %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = gutenberg_id %% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)


library(ggplot2)

ggplot(comedies_sentiment, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x") +
  my_theme()

We know from the lexicons that a negative or positve emotion is assigned based on the words. It appears from the graph that the play The Comedy of Errors and The Tempest seem to have a lot of words associated with negative emotions. Negative emotions in a comedy Play? Let’s see what are these words contributing to this sentiment in The Comedy of Errors?

Step 11: Check the contributing words to a negative or positive sentiment for sepcific plays.

tidy_comedies %>%
filter(title == "The Comedy of Errors") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 112 x 2
##    word      n
##    <chr> <int>
##  1 good     28
##  2 money    26
##  3 art      22
##  4 love     19
##  5 pray     17
##  6 sweet    13
##  7 merry    11
##  8 god      10
##  9 marry    10
## 10 jest      9
## # ... with 102 more rows
bing_word_counts <- tidy_comedies %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

bing_word_counts
## # A tibble: 1,409 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 good   positive    330
##  2 love   positive    285
##  3 well   positive    239
##  4 like   positive    197
##  5 master positive    123
##  6 sweet  positive     97
##  7 fair   positive     88
##  8 poor   negative     80
##  9 death  negative     78
## 10 die    negative     76
## # ... with 1,399 more rows
tidy_comedies %>%
filter(title == "The Comedy of Errors") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 112 x 2
##    word      n
##    <chr> <int>
##  1 good     28
##  2 money    26
##  3 art      22
##  4 love     19
##  5 pray     17
##  6 sweet    13
##  7 merry    11
##  8 god      10
##  9 marry    10
## 10 jest      9
## # ... with 102 more rows
bing_word_counts <- tidy_comedies %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

bing_word_counts
## # A tibble: 1,409 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 good   positive    330
##  2 love   positive    285
##  3 well   positive    239
##  4 like   positive    197
##  5 master positive    123
##  6 sweet  positive     97
##  7 fair   positive     88
##  8 poor   negative     80
##  9 death  negative     78
## 10 die    negative     76
## # ... with 1,399 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
     x = NULL) +
coord_flip()

Let’s look at TRAGEDIES

tragedies <- gutenberg_download(c(1522, 1534, 1787, 1533, 1793), meta_fields = "title")

Step 1: Check if the Tragedies have downloaded correctly

tragedies %>%
  count(title)
## # A tibble: 5 x 2
##   title                    n
##   <chr>                <int>
## 1 Antony and Cleopatra  6638
## 2 Hamlet                5146
## 3 Julius Caesar         4637
## 4 Macbeth               4152
## 5 Othello               4456

To work as a tidy dataset, the data needs to be restructured to one-token-per-row format. This is done using the function unnest_tokens().

It breaks the text into individual tokens. A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of breaking the text into tokens.

Step 2: Split the original text into Tokens using the function unnest_tokens()

tidy_tragedies <- tragedies %>%
  unnest_tokens(word, text)

tidy_tragedies
## # A tibble: 129,996 x 3
##    gutenberg_id title         word       
##           <int> <chr>         <chr>      
##  1         1522 Julius Caesar julius     
##  2         1522 Julius Caesar caesar     
##  3         1522 Julius Caesar by         
##  4         1522 Julius Caesar william    
##  5         1522 Julius Caesar shakespeare
##  6         1522 Julius Caesar persons    
##  7         1522 Julius Caesar represented
##  8         1522 Julius Caesar julius     
##  9         1522 Julius Caesar caesar     
## 10         1522 Julius Caesar octavius   
## # ... with 129,986 more rows

Text analysis requires Stop Words to be removed. Stop Words are Words that don’t mean anything or are not useful for any analysis. Such as “the”, “of”, “to”…etc.

Step 3: Remove the Stop Words with this simple line of code.

data(stop_words)

tidy_tragedies <- tidy_tragedies %>%
  anti_join(stop_words)

Step 4: Having cleared our data from Stop Words, let’s use dplyr’s count() function to find the most common words in our list of selected tragedies.

What are the most common words in the slected Tragedies of Shakespeare?

tidy_tragedies %>%
  count(word, sort = TRUE)
## # A tibble: 9,669 x 2
##    word       n
##    <chr>  <int>
##  1 thou     632
##  2 antony   510
##  3 caesar   508
##  4 lord     452
##  5 thy      382
##  6 enter    377
##  7 brutus   373
##  8 iago     362
##  9 thee     359
## 10 ham      358
## # ... with 9,659 more rows

The word Thou takes the top spot again followed by antony, caesar and lord.

Step 5: Plot a graph to see the most common words in the Tragedies.

tidy_tragedies %>%
  count(word, sort = TRUE) %>%
  filter(n > 300) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Look at specific sentiments as we did for comedies.

sentiments
## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

Assess scores of negative and positive sentiments using get_sentiments

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows

bing lexicon categorizes words in a binary fashion into positive and negative categories.

get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

nrc lexicon for categorization of words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Notice the word abandoned dominates the plays that are Tragedies. In terms of emotions, common words that express tragedy such as anger, sadness and fear stand out.

Step 6: Look at the overall sentiment of the 5 Tragedies we have chosen.

tidy_tragedies <- tragedies %>%
  group_by(title) %>%
  mutate(gutenberg_id = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

sentiments_check <- get_sentiments("nrc")

sentiments_check
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
nrc_sorrow <- get_sentiments("nrc") %>%
  filter(sentiment == "sorrow")

nrc_sorrow
## # A tibble: 0 x 2
## # ... with 2 variables: word <chr>, sentiment <chr>
tidy_tragedies %>%
#filter(title == "The Comedy of Errors") %>%
inner_join(nrc_sorrow) %>%
count(word, sort = TRUE)
## # A tibble: 0 x 2
## # ... with 2 variables: word <chr>, n <int>
library(tidyr)

# Subtracting the number of negative words from the Positive. Othello appears to have the most
# number of negative words.

tragedies_sentiment <- tidy_tragedies %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = gutenberg_id %% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

library(ggplot2)

ggplot(tragedies_sentiment, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x")

It appears from the graph that the play Antony and Cleopatra and Othello seem to have sveral words associated with negative emotions. What are the words contributing to this sentiment in Antony and Cleopatra?

Step 7: Check the contributing words to a negative or positive sentiment for sepcific plays.

tidy_tragedies %>%
filter(title == "Antony and Cleopatra") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 159 x 2
##    word        n
##    <chr>   <int>
##  1 good       95
##  2 love       39
##  3 friend     26
##  4 fortune    25
##  5 art        19
##  6 pray       19
##  7 peace      16
##  8 true       14
##  9 clown      10
## 10 honest     10
## # ... with 149 more rows
bing_word_counts <- tidy_tragedies %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

bing_word_counts
## # A tibble: 1,535 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 good   positive    419
##  2 well   positive    329
##  3 like   positive    252
##  4 love   positive    238
##  5 great  positive    156
##  6 death  negative    134
##  7 heaven positive    133
##  8 noble  positive    121
##  9 fear   negative    117
## 10 dead   negative    100
## # ... with 1,525 more rows
tidy_tragedies %>%
filter(title == "Antony and Cleopatra") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 159 x 2
##    word        n
##    <chr>   <int>
##  1 good       95
##  2 love       39
##  3 friend     26
##  4 fortune    25
##  5 art        19
##  6 pray       19
##  7 peace      16
##  8 true       14
##  9 clown      10
## 10 honest     10
## # ... with 149 more rows
bing_word_counts <- tidy_tragedies %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

bing_word_counts
## # A tibble: 1,535 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 good   positive    419
##  2 well   positive    329
##  3 like   positive    252
##  4 love   positive    238
##  5 great  positive    156
##  6 death  negative    134
##  7 heaven positive    133
##  8 noble  positive    121
##  9 fear   negative    117
## 10 dead   negative    100
## # ... with 1,525 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
     x = NULL) +
coord_flip()

We notice that negative words such as *death, fear and poor are relatively lower compared to the positive words such as good, well, like and love.

CONCLUSION

The aim of this vignette was simply to illustrate the ease with which one can explore texts with the tidytext package in combination with other tidy tools.

The words quantified and analysed are just from 5 plays each based on their genre. The results obtained certainly reveal an interesting aspect of the bard’s plays.

References

Silge, Julia, and David Robinson. 2018. Text Mining with R. https://www.tidytextmining.com/index.html.

Wiedemann, Gregor. 2016. Text Mining for Qualitative Data Analysis in the Social Sciences: A Study on Democratic Discourse in Germany. Wiesbaden, GERMANY: Vieweg. http://ebookcentral.proquest.com/lib/uts/detail.action?docID=4653480.