Text won’t be tidy at all stages of an analysis, and it is important to be able to convert back and forth between tidy and non-tidy formats.(Silge and Robinson 2018)
Computer Assisted Text analytics means much more than counting words. In particular, the combination of pattern-based and complex statistical approaches may be applied to support established qualitative data analysis designs and open them to a quantitative perspective.(Wiedemann 2016)
This Vignette explains a possible approach to do sentimental analysis in a literary piece of work using Tidy Text. Based on the genre of a literary piece of work, can we say the sentiments conveyed are also the same?
What that means is, do Tragedies have words associated to tragic emotions? Do Comedies have words associated with comical emotions? If so, what are those words and sentiments?
To find out if sentiments conveyed are the same as the genre of a literary work, I chose the Tragedies and Comedies of William Shakespeare.
The following works of Shakespeare were selected from the Project Gutenberg collection(https://www.gutenberg.org/).
Tragedies:
Antony and Cleopatra
Hamlet
Julius Caesar
Macbeth
Othello
Comedies:
A Midsummer Night’s Dream
Measure for Measure
The Comedy of Errors
The Tempest
As You Like It
Below are the steps to discover the sentiments conveyed in these plays. Let’s find out.
Step 1: Initialise the required packages.
library(dplyr)
library(stringr)
library(tidytext)
library(gutenbergr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
library(jtools)
library(grid)
library(gridExtra)
library(ggplotify)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.5.3
Customise the ggplot2 theme
my_theme <- function() {
theme_apa(legend.pos = "none") +
theme(panel.background = element_blank()) +
theme(plot.background = element_rect(fill = "antiquewhite1")) +
theme(panel.border = element_blank()) + # facet border
theme(strip.background = element_blank()) + # facet title background
theme(plot.margin = unit(c(.5, .5, .5, .5), "cm"))
}
The gutenbergr package includes tools for downloading books and the complete dataset of Project Gutenberg metadata which can be used to find works of interest.
Step 2: Check the metadata fields of Gutenberg works and see the avaiable columns and how the metadata is structured .
gutenberg_metadata
## # A tibble: 51,997 x 8
## gutenberg_id title author gutenberg_autho~ language gutenberg_books~
## <int> <chr> <chr> <int> <chr> <chr>
## 1 0 <NA> <NA> NA en <NA>
## 2 1 The ~ Jeffe~ 1638 en United States L~
## 3 2 "The~ Unite~ 1 en American Revolu~
## 4 3 John~ Kenne~ 1666 en <NA>
## 5 4 "Lin~ Linco~ 3 en US Civil War
## 6 5 The ~ Unite~ 1 en American Revolu~
## 7 6 Give~ Henry~ 4 en American Revolu~
## 8 7 The ~ <NA> NA en <NA>
## 9 8 Abra~ Linco~ 3 en US Civil War
## 10 9 Abra~ Linco~ 3 en US Civil War
## # ... with 51,987 more rows, and 2 more variables: rights <chr>,
## # has_text <lgl>
We see there are over 50,000 titles available from the Gutenberg library. How do we download the book of our choice?
Step 3: As an example, let’s look at a play of our choice - Julius Caesar.
gutenberg_metadata %>%
filter(title == "Julius Caesar")
## # A tibble: 6 x 8
## gutenberg_id title author gutenberg_autho~ language gutenberg_books~
## <int> <chr> <chr> <int> <chr> <chr>
## 1 1522 Juli~ Shake~ 65 en <NA>
## 2 1785 Juli~ Shake~ 65 en <NA>
## 3 2263 Juli~ Shake~ 65 en <NA>
## 4 9875 Juli~ Shake~ 65 de DE Drama
## 5 18512 Juli~ Shake~ 65 fi <NA>
## 6 46768 Juli~ Shake~ 65 la <NA>
## # ... with 2 more variables: rights <chr>, has_text <lgl>
Notice that the play is available in multiple versions in multiple languages. To download specific titles, filter by Title and note the gutenberg_id of the version you want to download. The gutenberg ID for Julius Caesar is 1522. Let’s download.
Julius_Caesar <- gutenberg_download(1522)
Julius_Caesar
## # A tibble: 4,637 x 2
## gutenberg_id text
## <int> <chr>
## 1 1522 JULIUS CAESAR
## 2 1522 ""
## 3 1522 by William Shakespeare
## 4 1522 ""
## 5 1522 ""
## 6 1522 ""
## 7 1522 ""
## 8 1522 PERSONS REPRESENTED
## 9 1522 ""
## 10 1522 JULIUS CAESAR
## # ... with 4,627 more rows
Step 4: Now that we know how to access the Gutenberg library and download books of our choice, let’s continue with our Sentiment Analysis and download the Comedies and Tragedies we need for our analysis.
plays <- gutenberg_download(c(1504, 1540, 1530, 1523, 1514, 1522, 1534, 1787, 1533, 1793), meta_fields = "title")
Step 5: Check if the plays have downloaded correctly.
plays %>%
count(title)
## # A tibble: 10 x 2
## title n
## <chr> <int>
## 1 A Midsummer Night's Dream 3459
## 2 Antony and Cleopatra 6638
## 3 As You Like It 4530
## 4 Hamlet 5146
## 5 Julius Caesar 4637
## 6 Macbeth 4152
## 7 Measure for Measure 4905
## 8 Othello 4456
## 9 The Comedy of Errors 3194
## 10 The Tempest 3888
To work as a tidy dataset, data needs to be restructured to one-token-per-row format. This is done using the function unnest_tokens().
It breaks the text into individual tokens. A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of breaking the text into tokens.
Step 6: Split the original text into Tokens using the function unnest_tokens()
tidy_plays <- plays %>%
unnest_tokens(word, text)
tidy_plays
## # A tibble: 227,214 x 3
## gutenberg_id title word
## <int> <chr> <chr>
## 1 1504 The Comedy of Errors the
## 2 1504 The Comedy of Errors comedy
## 3 1504 The Comedy of Errors of
## 4 1504 The Comedy of Errors errors
## 5 1504 The Comedy of Errors by
## 6 1504 The Comedy of Errors william
## 7 1504 The Comedy of Errors shakespeare
## 8 1504 The Comedy of Errors persons
## 9 1504 The Comedy of Errors represented
## 10 1504 The Comedy of Errors solinus
## # ... with 227,204 more rows
Text analysis also drequires Stop Words to be removed. Stop Words are words that don’t mean anything or are not useful for any analysis. Such as “the”, “of”, “to”…etc.
Step 7: Remove the Stop Words with this simple line of code.
data(stop_words)
tidy_plays <- tidy_plays %>%
anti_join(stop_words)
Having cleaned our data from Stop Words, what are the most common words in our selected plays of Shakespeare?
Step 8: Let’s use dplyr’s count() function to find the most common words in our list of selected plays.
tidy_plays %>%
count(word, sort = TRUE)
## # A tibble: 13,037 x 2
## word n
## <chr> <int>
## 1 thou 1322
## 2 thy 732
## 3 thee 723
## 4 sir 653
## 5 lord 631
## 6 enter 625
## 7 love 523
## 8 antony 510
## 9 caesar 510
## 10 hath 481
## # ... with 13,027 more rows
The word thou takes the top spot followed by thy and thee.
Step 9: Plot a graph to see how the top words look visually.
tidy_plays %>%
count(word, sort = TRUE) %>%
filter(n > 300) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
my_theme()
The tidytext package contains several sentiment lexicons (reference work containing a list of words in alphabetical order, giving their meaning, translation and/or other information) in the sentiments dataset.
Words are assigned to specific sentiments which in turn are associated to a lexicon with a certain score for positive or negative sentiment including emotions such as joy, sadness, disgust, fear, surprise, trust…etc.
sentiments
## # A tibble: 27,314 x 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 abacus trust nrc NA
## 2 abandon fear nrc NA
## 3 abandon negative nrc NA
## 4 abandon sadness nrc NA
## 5 abandoned anger nrc NA
## 6 abandoned fear nrc NA
## 7 abandoned negative nrc NA
## 8 abandoned sadness nrc NA
## 9 abandonment anger nrc NA
## 10 abandonment fear nrc NA
## # ... with 27,304 more rows
There are three general purpose lexicons:
AFINN from Finn Arup Nielsen,
bing from Bing Liu and collaborators,
nrc from Saif Mohammad and Peter Turney.
Tidytext provides a function get_sentiment() to get specific sentiment lexicons without the columns that are not used in that lexicon.
All these Lexicons can be accessed using the tidytext function get_sentiment() to get specific sentiment lexicons.
AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
bing lexicon categorizes words in a binary fashion into positive and negative categories.
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
nrc lexicon categorizes words in a binary fashion (yes or no) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
Now that we have a way to associate emotions to words, we can figure out the sentiment associated to each of the selected plays.
Step 10: Let’s look at the overall sentiment of the plays we have chosen.
tidy_plays <- plays %>%
group_by(title) %>%
mutate(gutenberg_id = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
sentiments_check <- get_sentiments("nrc")
sentiments_check
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
nrc_joy
## # A tibble: 689 x 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # ... with 679 more rows
tidy_plays %>%
#filter(title == "The Comedy of Errors") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 357 x 2
## word n
## <chr> <int>
## 1 good 749
## 2 love 523
## 3 art 222
## 4 pray 189
## 5 true 188
## 6 sweet 161
## 7 clown 140
## 8 friend 120
## 9 god 110
## 10 young 102
## # ... with 347 more rows
library(tidyr)
# Subtracting the number of negative words from the Positive. Othello appears to have the most
# number of negative words.
plays_sentiment <- tidy_plays %>%
inner_join(get_sentiments("bing")) %>%
count(title, index = gutenberg_id %% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
library(ggplot2)
ggplot(plays_sentiment, aes(index, sentiment, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x")
We know from the lexicons that a negative or positve emotion is assigned based on the words.
It is interesting indeed, to notice that mostly there are more negative words in tragedies than comedies. However, The Comedy of Errors and The Tempest being comedies seem to have a lot of words associated with negative emotions while Antony and Cleopatra and Othello being tragedies have a higher number of words associated with positive emotions.
Negative emotions in a comedy Play?Or vice versa. Let’s see what are these words contributing to these sentiments.
Step 11: Check the contributing words to a negative or positive sentiment.
tidy_plays %>%
#filter(title == "The Comedy of Errors") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 357 x 2
## word n
## <chr> <int>
## 1 good 749
## 2 love 523
## 3 art 222
## 4 pray 189
## 5 true 188
## 6 sweet 161
## 7 clown 140
## 8 friend 120
## 9 god 110
## 10 young 102
## # ... with 347 more rows
bing_word_counts <- tidy_plays %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
## # A tibble: 1,965 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 good positive 749
## 2 well positive 568
## 3 love positive 523
## 4 like positive 449
## 5 great positive 217
## 6 death negative 212
## 7 heaven positive 196
## 8 fear negative 170
## 9 sweet positive 161
## 10 master positive 158
## # ... with 1,955 more rows
# tidy_plays %>%
# filter(title == "Othello") %>%
# inner_join(nrc_joy) %>%
# count(word, sort = TRUE)
bing_word_counts <- tidy_plays %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
## # A tibble: 1,965 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 good positive 749
## 2 well positive 568
## 3 love positive 523
## 4 like positive 449
## 5 great positive 217
## 6 death negative 212
## 7 heaven positive 196
## 8 fear negative 170
## 9 sweet positive 161
## 10 master positive 158
## # ... with 1,955 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
Overall, we notice that negative words such as *death, fear and poor are relatively lower compared to the positive words such as good, well, like and love in these plays.
We can follow the same process as above and filter specific plays to see what words contribute to a positive or negative emotion.
The aim of this vignette was simply to illustrate the ease with which one can explore texts with the tidytext package in combination with other tidy tools.
The words quantified and analysed are just from 10 plays of Shakespeare based on their genre. The results obtained certainly reveal an interesting aspect of the bard’s plays.
Silge, Julia, and David Robinson. 2018. Text Mining with R. https://www.tidytextmining.com/index.html.
Wiedemann, Gregor. 2016. Text Mining for Qualitative Data Analysis in the Social Sciences: A Study on Democratic Discourse in Germany. Wiesbaden, GERMANY: Vieweg. http://ebookcentral.proquest.com/lib/uts/detail.action?docID=4653480.