We will use Sentiment analysis to have text analysis to systematically identify, extract, quantify, and study affective states and subjective information. We will do this on the corpus of Novels and using different sentiment lexicon as discussed further in below sections.
Loading the required libraries:
#install.packages("tidytext")
library(tidytext)
#install.packages("textdata")
library(textdata)
#install.packages("janeaustenr")
library(janeaustenr)
library(dplyr)
library(stringr)
library(knitr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(reshape2)
#Below additionaly for our use case:
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
#devtools::install_github("mjockers/syuzhet")
library("syuzhet")The use case leverages the data provided in the harrypotter package. The package has been provided by bradleyboehmke.
The seven novels we are working with, and are provided by the harrypotter package, include:
philosophers_stone: Harry Potter and the Philosophers Stone (1997)chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)goblet_of_fire: Harry Potter and the Goblet of Fire (2000)order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)half_blood_prince: Harry Potter and the Half-Blood Prince (2005)deathly_hallows: Harry Potter and the Deathly Hallows (2007)Each text is in a character vector with each element representing a single chapter.
#To perform sentiment analysis we need to have our data in a tidy format:
#Vector of title names:
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows")
#vector of books:
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince,
deathly_hallows)
# Creating the tidy dataset series.full:
series.full <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series.full <- rbind(series.full, clean)
}
# set factor to keep books in order of publication:
series.full$book <- factor(series.full$book, levels = rev(titles))
#final tidy dataset ready for Analysis:
series.full## # A tibble: 1,089,386 x 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
## 7 Philosopher's Stone 1 mrs
## 8 Philosopher's Stone 1 dursley
## 9 Philosopher's Stone 1 of
## 10 Philosopher's Stone 1 number
## # ... with 1,089,376 more rows
# Getting the sentiment lexicom for loughran:
loughran.sentiments <- get_sentiments("loughran")
str(loughran.sentiments)## Classes 'tbl_df', 'tbl' and 'data.frame': 4150 obs. of 2 variables:
## $ word : chr "abandon" "abandoned" "abandoning" "abandonment" ...
## $ sentiment: chr "negative" "negative" "negative" "negative" ...
We will first Remove Stop Words from the book series dataset.This will help us to look and process a reduced and focused word sets for our analysis:
#We will use the anti_join() function to remove all stop words from our series set:
series.main <- series.full %>%
anti_join(stop_words)
series.main## # A tibble: 409,338 x 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 boy
## 2 Philosopher's Stone 1 lived
## 3 Philosopher's Stone 1 dursley
## 4 Philosopher's Stone 1 privet
## 5 Philosopher's Stone 1 drive
## 6 Philosopher's Stone 1 proud
## 7 Philosopher's Stone 1 perfectly
## 8 Philosopher's Stone 1 normal
## 9 Philosopher's Stone 1 people
## 10 Philosopher's Stone 1 expect
## # ... with 409,328 more rows
we can see the final dataset size has reduced with the removal from stop words; from 1,089,386 to 409,338 rows.
Checking for negative AND positive sentiments in the first book philosophers_stone :
# Creating a dataset for `negative` sentiment tokens:
loughran.sentiments.negative <- loughran.sentiments %>%
filter(sentiment == "negative")
# Creating a dataset for `positive` sentiment tokens:
loughran.sentiments.positive <- loughran.sentiments %>%
filter(sentiment == "positive")
# For negative tokens, we can use the inner_join() function to get all of the negative words from the book "Philosopher's Stone"; we will then count each frequency of the word occurnaces and plot on a wordcloud:
series.main %>%
filter(book == "Philosopher's Stone" ) %>%
inner_join(loughran.sentiments.negative) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100 ))# We will repeat the above steps for positive tokens and plot similarly on a wordcloud:
series.main %>%
filter(book == "Philosopher's Stone" ) %>%
inner_join(loughran.sentiments.positive) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100 )) We can see there are more negative words then positive words in the first Book.
Chapter wise Sentiments score:
# We will just take the positive and negative sentiments across all books in the series chappter wise grouped:
series.main.sentiment <- series.main %>%
inner_join(loughran.sentiments) %>%
count(book, index = chapter %/% 1, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
series.main.sentiment## # A tibble: 200 x 9
## book index constraining litigious negative positive superfluous uncertainty
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Deat~ 1 4 2 47 20 0 10
## 2 Deat~ 2 1 11 85 26 0 16
## 3 Deat~ 3 0 2 45 10 0 7
## 4 Deat~ 4 2 1 59 7 0 9
## 5 Deat~ 5 1 4 85 5 0 16
## 6 Deat~ 6 6 3 92 25 0 19
## 7 Deat~ 7 2 5 75 33 0 18
## 8 Deat~ 8 1 4 63 30 0 15
## 9 Deat~ 9 1 3 51 4 0 4
## 10 Deat~ 10 8 2 75 28 0 11
## # ... with 190 more rows, and 1 more variable: sentiment <dbl>
# We can now plot it to visualize the spread of the sentiments:
ggplot(series.main.sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")From the ggplot and the table data above, we can see that the book overall is more negative then postive score for each chapter; the books are not for smaller children perhaps.
Top words across all sentiments in all books:
# Below runs across all the books in the series for all sentiments types and plots the top 15 words per sentiment category for the entie series:
series.main %>%
inner_join(loughran.sentiments) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(15) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()We can conlude by saying that harry potter did have a adventurous but a thrilling life. Rightly fits into the genre of Fantasy, drama, young adult fiction, mystery, and thriller [wikipedia link]
Below are from the example code provided in book Text Mining with R, Chapter 2 looks at Sentiment Analysis.
The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are
AFINN from Finn Årup Nielsen,bing from Bing Liu and collaborators, andnrc from Saif Mohammad and Peter Turney.All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of 2 variables:
## $ word : chr "abandon" "abandoned" "abandons" "abducted" ...
## $ value: num -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
## - attr(*, "spec")=
## .. cols(
## .. word = col_character(),
## .. value = col_double()
## .. )
## Classes 'tbl_df', 'tbl' and 'data.frame': 6786 obs. of 2 variables:
## $ word : chr "2-faces" "abnormal" "abolish" "abominable" ...
## $ sentiment: chr "negative" "negative" "negative" "negative" ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 13901 obs. of 2 variables:
## $ word : chr "abacus" "abandon" "abandon" "abandon" ...
## $ sentiment: chr "trust" "fear" "negative" "sadness" ...
Joy score from the NRC lexiconLet’s look at the words with a joy score from the NRC lexicon and compare against the corpus austen_books.
# austen_books pull:
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
kable(head(tidy_books,10 ) )| book | linenumber | chapter | word |
|---|---|---|---|
| Sense & Sensibility | 1 | 0 | sense |
| Sense & Sensibility | 1 | 0 | and |
| Sense & Sensibility | 1 | 0 | sensibility |
| Sense & Sensibility | 3 | 0 | by |
| Sense & Sensibility | 3 | 0 | jane |
| Sense & Sensibility | 3 | 0 | austen |
| Sense & Sensibility | 5 | 0 | 1811 |
| Sense & Sensibility | 10 | 1 | chapter |
| Sense & Sensibility | 10 | 1 | 1 |
| Sense & Sensibility | 13 | 1 | the |
# nrc sentiments joy:
nrc.sentiments.joy <- nrc.sentiments %>%
filter(sentiment == 'joy')
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc.sentiments.joy) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
# nrc sentiments sadness:
nrc.sentiments.sadness <- nrc.sentiments %>%
filter(sentiment == 'sadness')
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc.sentiments.sadness) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 347 x 2
## word n
## <chr> <int>
## 1 doubt 98
## 2 ill 72
## 3 bad 60
## 4 leave 58
## 5 mother 57
## 6 feeling 56
## 7 impossible 41
## 8 pain 34
## 9 evil 33
## 10 wanting 33
## # ... with 337 more rows
# jane_austen_sentiment:
jane_austen_sentiment <- tidy_books %>%
inner_join(bing.sentiments) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Let’s use all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice. First, let’s use filter() to choose only the words from the one novel we are interested in.
# pride_prejudice:
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
kable(head(pride_prejudice,10 ) )| book | linenumber | chapter | word |
|---|---|---|---|
| Pride & Prejudice | 1 | 0 | pride |
| Pride & Prejudice | 1 | 0 | and |
| Pride & Prejudice | 1 | 0 | prejudice |
| Pride & Prejudice | 3 | 0 | by |
| Pride & Prejudice | 3 | 0 | jane |
| Pride & Prejudice | 3 | 0 | austen |
| Pride & Prejudice | 7 | 1 | chapter |
| Pride & Prejudice | 7 | 1 | 1 |
| Pride & Prejudice | 10 | 1 | it |
| Pride & Prejudice | 10 | 1 | is |
# Get the split counts:
afinn <- pride_prejudice %>%
inner_join(afinn.sentiments) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining, by = "word"
bing_and_nrc <- bind_rows(pride_prejudice %>%
inner_join(bing.sentiments) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(nrc.sentiments %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
## Joining, by = "word"
# Plot:
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")# Positive and negative words are in these lexicons:
nrc.sentiments %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.
# bing_word_counts:
bing_word_counts <- tidy_books %>%
inner_join(bing.sentiments) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining, by = "word"
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()## Selecting by n
# custom_stop_words:
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud. The size of a word’s text in below figure is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.
## Joining, by = "word"
tidy_books %>%
inner_join(bing.sentiments) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)## Joining, by = "word"
Looking at units beyond just words; some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole.
# token = "sentences"; we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case :
PandP_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
PandP_sentences$sentence[2]## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
# chapters: unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
#bing.sentiments.negative; let’s find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words:
bing.sentiments.negative <- bing.sentiments %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bing.sentiments.negative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
top_n(1) %>%
ungroup()## Joining, by = "word"
## Selecting by ratio
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343