Firstly, Jane Austen’s books, after being imported by the
janeaustenr package, are converted to the tidy format for
sentiment analysis. However, before the analysis is done, 2 new columns
are added to indicate the line and chapter position of each word. Also,
the output column for the unnesting of the text is deliberately named
“word”. As “word” is also the name of columns of interest from
tidytext’s various dictionaries for sentiment analysis and
function word (stop word) filtering, having the common name makes it
easier to perform inner join and anti join operations.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Next, sentiment analysis will be done on Jane Austen’s Emma
using the lexicon NRC that is found in tidytext.
To be more specific, we are interested in finding all the words in
Emma with a joyous connotation. To that end, we can take a
subset of NRC that only contains words that are categorized as
expressing the sentiment of “joy”. Then, an inner join between the words
of Emma and the words of the joy lexicon will get us the
desired “joy” words in Emma.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # … with 291 more rows
In the following section, change in sentiment in all of Austen’s
novels will be visualized. To do this, first we need to use a different
lexicon named Bing that categorizes words into either
“positive” or “negative” categories. Then, using inner_join() on the
word column of both tibbles (word is passed as
the argument by default because it is the only common column between the
2 tibbles), we can get all the words that are categorized as either
positive or negative. After that, the words are assigned to groups of 80
consecutive lines, beginning with the first line and ending with the
last. The group designation of each word is saved in the
index column. That means, words in lines 1 to 79 would be
assigned the index 0, then those in lines 80 to 159 would be assigned
the index 1, and so forth. Finally, a net sentiment score is calculated
per index by subtracting the negative count from the positive count.
Each novel is separated into groups of 80 lines for this particular sentiment analysis because finding the net score on groups too big would return a value close to 0. Meanwhile, if each group is too small, then the sentiment scores would fluctuate erratically and not represent the true narrative flow of the novel.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
Now, we are ready to plot the data with index as the
explanatory variable and sentiment as the response
variable.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
The graphs seem to suggest that Persuasion is Austen’s most consistently-positive novel. Interesting!
In tidytext, there are 3 sentiment lexicons:
NRC, Bing, and AFINN. To see how they are
different from each other, all of them will be used to perform the task
of tracking changes in the narrative arc of Jane Austen’s Pride and
Prejudice.
Using AFINN:
# Selecting only Pride & Prejudice
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
# Finding the sentiment value of each narrative section called index
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
In the chunk above, we are again performing our sentiment analysis on indices comprising 80 lines. Additionally, since the AFINN lexicon categorizes words on a scale of -5 to 5 (with -5 being the most negative and 5 being the most positive), we are not adding the counts of positive and negative words. Rather, we are adding the sentiment value of all words within each index to find the net sentiment score for that index.
Using Bing and NRC:
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
In the case of Bing, the inner join is simple because all its words are already categorized as either “positive” or “negative”. However, for NRC, after the inner join, we have to filter for only the categories of “positive” and “negative”. Additionally, as was the case previously, sentiment analysis is done on indices consisting of 80 consecutive lines. Furthermore, unlike AFINN, the difference of count values for positive and negative words are used for each index to get the net sentiment value. This is because the sentiment categorization is done qualitatively under Bing and NRC, while it is done quantitatively under AFINN.
Visualizing and comparing all the results:
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
As we can see, though all three lexicons return a similar trend in sentiment change, there are some differences. For instance, it is immediately clear that NRC has fewer negatives indices than the other 2. Why is this? Let’s see.
# Positive and negative words in NRC
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
# Positive and negative words in Bing
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
So, NRC has fewer negative indices because the lexicon has a higher ratio of positive to negative words than that of Bing.
In this section, we will look closer at the individual words
themselves rather than looking at the overall change of sentiment within
narrative arcs. This will be done in regard to all the novels in
janeaustenr.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
head(bing_word_counts)
## # A tibble: 6 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
Now, let’s visualize the data.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
We can easily create a wordcloud for janeaustenr using
the wordcloud package. Before we do so, we need to filter
out the function words using the stop_words tibble from
tidytext.
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
Looks like “time” is the most common word among Austen’s novels.
Sentiment analysis at the word level requires the production of unigrams (single words) through the use of the unnest_tokens() function. However, this same function can be used to produce text units beyond unigrams: like sentences. The following code does exactly this:
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
However, it is possible that we are interested in defining a text unit ourselves. In that case, we can pass “regex” to the parameter token in unnest_tokens() to indicate that we want to define our own text unit with a regex pattern. In the code below, this is done to break the text down into chapters. So, each token produced by the function is a chapter of the novel.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
To give an example of the type of analysis we can do on the level of chapters, let’s find the most negative chapter from each of Austen’s novels.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
Before I get started with extending the code, I need to select a book to do my analyses on.
# Importing The Hound of the Baskervilles text using the gutenbergr package
hound <- gutenberg_metadata %>%
filter(title == 'The Hound of the Baskervilles', has_text == TRUE) %>%
slice(1) %>%
gutenberg_download()
# Adding chapter and line information to each row
hound <- hound %>%
mutate(
line = row_number(),
chapter = cumsum(
str_detect(text,
regex('^Chapter \\d{1,2}.$', ignore_case = TRUE))
)
) %>%
ungroup() %>%
filter(chapter > 0)
It makes sense to begin the analysis with a visual of the common words in “The Hound of the Baskervilles” (excluding function words).
hound %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
with(wordcloud(word, n, max.words = 50))
“The Hound of the Baskervilles” is set in the Victorian era, a time period well known for its social etiquettes. So, it makes sense that the most common word is “sir”. “moor” and “holmes” are also expected, seeing as the main character is Sherlock Holmes, and the story takes place in a moor. “henry”, “watson”, and “mortimer” are of course other characters in the story. The appearance of “baskerville” and “hound” is self-explanatory.
“The Hound of the Baskervilles” is a gothic mystery novel. The
setting comprises moors, swaps, dark nights, howls, and
will-o’-the-wisps. So, I expect the novel to be rich in descriptive
words, meaning adjectives. Let’s see if that’s the case using the
parts_of_speech lexicon from the package
tidytext. This lexicon categorizes words according to the
their part of speech.
# All the adjectives in the novel
hound_adj <- hound %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
semi_join(
parts_of_speech %>%
filter(pos == 'Adjective')
)
# All the nouns in the novel
hound_noun <- hound %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
semi_join(
parts_of_speech %>%
filter(pos == 'Noun')
)
# The ratio between adjectives and nouns
adj_ratio <- nrow(hound_adj) / nrow(hound_noun)
According to the results of the analysis, there are 3974 adjectives in “The Hound of the Baskervilles”. To give a relative measure of how high that number is, I found the ratio between the number of adjectives and the number of nouns because adjectives are descriptors of nouns. The ratio is 0.3449054.
To see a visual representation of the most common adjectives:
hound_adj %>%
count(word, name = 'number') %>%
slice_max(order_by = number, n = 5) %>%
ungroup() %>%
ggplot(mapping = aes(x = reorder(word, -number), y = number, fill = word)) +
geom_col(show.legend = FALSE) +
labs(title = 'Adjectives of The Hound of the Baskervilles', x = 'word', y = NULL)
I suspect “found”‘s inclusion is inappropriate within the context of this novel. In a mystery novel, “found” most likely refers to the verb. Overall, seeing “light” (in reference to will-o’-the-wisps) and “black” as the most frequent adjectives reinforces my belief that “The Hound of the Baskervilles” has a very gothic setting.
Let’s see if the nouns of the novel back up my assessment about the novel’s genre being gothic.
hound_noun %>%
count(word, sort = TRUE) %>%
ungroup() %>%
with(wordcloud(word, n, max.words = 50))
There are certainly some gothic words to be found in that wordcloud: moor, evening, death, night, dark, black, and light.
Finally, I want to get a feel for the narrative arc of the story by
using the lexicon hash_sentiment_jockers from the
package lexicon. The words in the lexicon are rated from -1
to 1. A more negative value indicates a more negative word and vice
versa.
hound %>%
unnest_tokens(word, text) %>%
inner_join(y = hash_sentiment_jockers, by = c('word' = 'x')) %>%
group_by(chapter) %>%
summarize(sentiment = sum(y)) %>%
ungroup() %>%
ggplot(mapping = aes(x = chapter, y = sentiment)) +
geom_line(color = 'cyan') +
scale_x_continuous(breaks = 1:15) +
labs(title = 'Narrative arc of The Hound of the Baskervilles',
x = 'chapter', y = 'sentiment_score')
Looking at the narrative arc, it looks like there are 3 main low points within the story: chapters 9, 12, and 14. Considering the plot of the story, these low points make sense. In chapter 9, there are 2 major scenes of conflict. In chapter 12, there are revelations about lies and affairs, and there is also a death. Meanwhile, chapter 14 is the climax of the story with another attempted murder and the escape of the antagonist (though it’s assumed that Stapleton dies, there is no confirmation). So, it makes sense that those chapters of “The Hound of the Baskervilles” would have the most negative sentiment ratings. The fact that this novel’s narrative arc fluctuates up and down is also reasonable considering it is a mystery novel that has various murders and revelations sprinkled throughout the story.
(Text Mining with R | Chapter 2)[https://www.tidytextmining.com/sentiment.html]