Tidy data principles can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or calculating word frequencies are natural applications of familiar operations within the tidy tool ecosystem. Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. This assignment is to leverage sentiment analysis using tidy data principles; when text data is in a tidy data structure, sentiment analysis can be implemented as an inner join.
There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are
library(stringr)
library(tidytext)
library(textdata)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
library(textdata)
library(wordcloud)
library(janeaustenr)
library(reshape2)
library(ggwordcloud)
library(wordcloud)
library(gutenbergr)
library(DT)The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one. ### Get Sentiments for AFINN
sentiments.afinn <- get_sentiments("afinn")
datatable(sentiments.afinn, filter = 'bottom', options = list(pageLength = 10))sentiments.bing <-get_sentiments("bing")
datatable(sentiments.bing, filter = 'bottom', options = list(pageLength = 10))sentiments.nrc<-get_sentiments("nrc")
datatable(sentiments.nrc, filter = 'bottom', options = list(pageLength = 10))tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts, how long the lines were to start with, etc. We then use pivot_wider() so that we have negative and positive sentiment in separate columns, and lastly ###
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
# plot these sentiment scores
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Plot of each novel changes toward more positive or negative sentiment over the trajectory of the story.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
Lets find the net sentiment in each of these sections of
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)Let’s bind them together and visualize them
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Let’s look briefly at how many positive and negative words are in these lexicons
# afinn
sentiments.afinn.negative<-sentiments.afinn %>% filter(value <0)
count(sentiments.afinn.negative)## # A tibble: 1 x 1
## n
## <int>
## 1 1598
sentiments.afinn.neutral<-sentiments.afinn %>% filter(value ==0)
count(sentiments.afinn.neutral)## # A tibble: 1 x 1
## n
## <int>
## 1 1
sentiments.afinn.positive<-sentiments.afinn %>% filter(value >0)
count(sentiments.afinn.positive)## # A tibble: 1 x 1
## n
## <int>
## 1 878
# bing
get_sentiments("bing") %>%
count(sentiment)## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
# nrc
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
Most common positive and negative words
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)Lets us spot an anomaly in the sentiment analysis; the word “miss” is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works. If it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list using bind_rows(). We could implement that with a strategy such as this
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
count(custom_stop_words)## # A tibble: 1 x 1
## n
## <int>
## 1 1150
Word Cloud
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]## [1] "by jane austen"
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text. Lets try to split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
Let’s find the number of negative words in each chapter and divide by the total words in each chapter
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
This was an interesting dive into Sentment Analysis which provides a way to understand the attitudes and opinions expressed in texts. We explored how to approach sentiment analysis using tidy data principles; when text data is in a tidy data structure, sentiment analysis can be implemented as an inner join. We can use sentiment analysis to understand how a narrative arc changes throughout its course or what words with emotional and opinion content are important for a particular text.
loughran_sent <- get_sentiments("loughran") %>%
filter(sentiment %in% c("positive","negative"))#Creating the tidy janes data set
tidydata <- jane_books[,c("text","book")] %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(
str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
head(jane_books)## gutenberg_id text book
## 1 105 Persuasion Persuasion
## 2 105 Persuasion
## 3 105 Persuasion
## 4 105 by Persuasion
## 5 105 Persuasion
## 6 105 Jane Austen Persuasion
northanger_abey <- tidydata %>%
filter(book == "Northanger Abbey")
afinn2 <- northanger_abey %>%
inner_join(get_sentiments("afinn")) %>%group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%mutate(method = "AFINN")
bing_and_nrc2 <-
bind_rows(
northanger_abey %>%inner_join(get_sentiments("bing")) %>%mutate(method = "Bing"),
northanger_abey %>%inner_join(get_sentiments("nrc") %>%filter(sentiment %in% c("positive","negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
loughran <- northanger_abey %>%
inner_join(loughran_sent) %>%mutate(method = "Loughran-McDonald") %>%
count(method,index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%mutate(sentiment = positive-negative)bind_rows(afinn2,
bing_and_nrc2) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")tidydata %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))tidydata %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)tidydata %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)When using the Loughran-McDonald lexicon to compare Jane Austin novels,author showed an average positive sentiment per every 80 line. That would be a great simplification of each genre and author. Used sentimental analysis using loughran lexicon an found difefrent sentiment words.