In this markdown we will explore text mining. We will be using the “sherlock” data package that provides access to the 7 Sherlock Holmes novels, which includes 1. A Scandal in Bohemia 2. A Study in Scarlet 3. The Adventure of the Red Circle 4. The Advaenture of Wisteria Lodge 5. The Hound of the Baskervilles 6. The Sign of the Four 7. The Valley Of Fear
We will be using Text Mining with R: A Tidy Approach by Julia Silge & David Robinson as a guide to quantify sentiment. Given that the story of Sherlock Holmes revolves around detective work, do we expect the novels to have a negative sentiment?
We will be using the following libraries to explore and mine sentiment from all 7 books: dplyr stringr tidyr tidytext ggplot2 lexicon wordcloud reshape2
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'tidytext' was built under R version 4.1.3
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'lexicon' was built under R version 4.1.3
## Warning: package 'wordcloud' was built under R version 4.1.3
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 4.1.3
## Warning: package 'reshape2' was built under R version 4.1.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
After loading the “sherlock” library it is crucial to tidy the text as much as possible. Given that the package contains several books, the groupby() function will serve to separate them. Then, a mutate function along with a regular expression is used to detect each chapter. A filter() is used to remove any row values where chapter is equal to 0 (zero). Removing these rows will filter out any headers and table of contents in every book.Lastly, we tokenize our text data. Tokenization is the practice of splitting text into words or larger groups of words, such as sentences or paragraphs. For our purposes, we will tokenize at the single word level.
#devtools::install_github("EmilHvitfeldt/sherlock")
library(sherlock)
holm <- holmes %>%
group_by(book) %>%
mutate(chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) |>
filter(chapter != 0) |> #remove intro from all books
ungroup()
tidy_books <- holmes %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) |>
filter(chapter != 0) %>%
ungroup() %>%
unnest_tokens(word, text)
Next we will initialize the three lexicons introduced in the textbook along with the lexicon of our choosing. For our choice, we will be using Sentiword. Aside from a dichotomous positive and negative sentiment, this lexicon includes a variable with a weighted score for each word.
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
To extract joyful words we begin by filtering our nrc list to “joy”. We proceed to use an inner_join to be left with only matching words from both dataframes. Whats left now is to do count of words.
#Get joyful words from nrc
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
#Get count of joyful words from Holmes series
tidy_books %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 378 x 2
## word n
## <chr> <int>
## 1 good 285
## 2 found 210
## 3 friend 154
## 4 hope 100
## 5 companion 88
## 6 treasure 77
## 7 true 55
## 8 save 53
## 9 remarkable 50
## 10 money 49
## # ... with 368 more rows
Another method can be used to text mine to evaluate sentiment at a deeper level is chunking - segmentation. Sometimes words on their own can be misleading (e.g.: “do not like”). The word “like” is negated by “not”. If we tokenize at a broader level, we may capture a more realistic sentiment. For our next example, we will segment our text to every 80 lines of text for each book. We will count both, the number of positive and the number negative words, and calculate the residual count (positive - negative). If the residual count is positive, we will assign the chunk as being in the positive side of sentiment, and vice versa.
In this particular segment we use the “bing” lexicon to label sentiment. Once labeled, we plot our residuals to evaluate overall sentiment for book. By using bing, we can assume that most Sherlock Holmes novels are written in a negative connotation.
#Establishing sentiment to each chunk
holmes_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
#Plot sentiment for each chunk
ggplot(holmes_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
# Other Lexicons What about the other lexicons?
In this portion we will evaluate the differences of the three lexicons introduced by Julia and David. We will look at one book: A study in Scarlet, to visualize the differences of the 3 lexicons.
#Extract one book
asib <- tidy_books %>%
filter(book == "A Study In Scarlet")
#Get sentiment for AFINN
afinn <- asib %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
#Get sentiment for bing and nrc
bing_and_nrc <- bind_rows(
asib %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
asib %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
#visualize all 3 lexicons
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
#Get count of positive vs negative words
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
What if we want to know which words appear the most often by sentiment? Tidyr has you covered. Using inner_join() and count() one can quantify each word and bin by a given category. In this case, positive and negative.However, there are too many words and it saturates our plot. To fix this we’ll use head() to slice our data to show only the top 10 most frequent words for each sentiment.
As mentioned by Julia and David, we too have “miss” as a frequent top word for negative sentiment. We will take their same approach and add it to our list of stop words. Just like other stop words, the suffix “miss” does not add any value to a sentiment analysis.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
#visualize most common positive and negative words
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
#Add a custom stop word to list
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
Using a wordcloud gives a ready an easy to view scope of the top words by using proportions from counts. The bigger the number, the bigger word. We’ll create one with a division that will display which ones hold positive sentiment and vice versa.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
#A Fourth Lexicon In this section we will explore using a lexicon not
mentioned in Text Mining for R. Given that sentiword gives us two values
(one categorical, one continous), we will take two approaches to
evaluate which one best fits our needs. For both approaches we will be
looking at residuals.
sentiword <- hash_sentiment_senticnet %>%
rename(
word = x,
sentiment_weight = y
) |>
mutate(sentiment_tone = ifelse(sentiment_weight > 0, "Positive", "Negative"))
#Get count of words for every 80-line section using sentiword for all Sherlock Holmes Books
holmes_sentiment_sentiword_cnt <- tidy_books %>%
inner_join(sentiword, by = "word") %>%
count(book, index = linenumber %/% 80, sentiment_tone) |>
pivot_wider(names_from = sentiment_tone, values_from = n, values_fill = 0) %>%
mutate(sentiment = Positive - Negative,
method = "sentiword cnt")
ggplot(holmes_sentiment_sentiword_cnt, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) + labs(title = "Sentiword Positive vs Negative Sentiments by Counts") +
facet_wrap(~book, ncol = 2, scales = "free_x")
## Evaluation based on the sentimental weight of each word.
#Get Positive or negative sentiment weight for every 80_line section
holmes_sentiment_sentiword_weight <- tidy_books %>%
inner_join(sentiword, by = "word") %>%
group_by(book, index = linenumber %/% 80) |>
summarise(sentiment_score = sum(sentiment_weight))
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
ggplot(holmes_sentiment_sentiword_weight, aes(index, sentiment_score, fill = book)) +
geom_col(show.legend = FALSE) + labs(title = "Sentiword Positive vs Negative Sentiments by Word Weight") +
facet_wrap(~book, ncol = 2, scales = "free_x")
# Comparing Lexicons
bing_and_nrc <- bind_rows(
tidy_books %>%
filter(book == "The Valley Of Fear") |>
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_books %>%
filter(book == "The Valley Of Fear") |>
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(holmes_sentiment_sentiword_cnt,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
#count positive and negatives for sentiword
sum(holmes_sentiment_sentiword_cnt$Positive)
## [1] 48069
sum(holmes_sentiment_sentiword_cnt$Negative)
## [1] 23831
senti_word_counts <- tidy_books %>%
inner_join(sentiword, by = "word") %>%
count(word, sentiment_tone, sort = TRUE) %>%
ungroup()
senti_word_counts %>%
group_by(sentiment_tone) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment_tone)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment_tone, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
## Wordcloud Sentiword
tidy_books %>%
inner_join(sentiword) %>%
count(word, sentiment_tone, sort = TRUE) %>%
acast(word ~ sentiment_tone, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
#Conclusion
By using tidyr and guidance from Text Mining with R we were able to evaluate the sentiment for all 7 Sherlock Holmes novels.Using AFINN, Bing et al. and nrc we found that there is heavy variance in sentiment. When trying the lexicon of our choosing, we found that it was overly optimistic with the model showing only positve sentiment with the dichotomous attribute as well as the weight sentiment score. Given the outcome of all 4 lexicons, I would label the Sherlock Holmes novels slightly more positive than negative overall.
#References *Emil Hvitfeldt (2022). sherlock: What the Package Does (One Line, Title Case). R package version 0.0.0.9000. **https://github.com/EmilHvitfeldt/sherlock
*Julia Silge and David Robinson 2016. Text Mining with R: A Tidy Approach. https://www.tidytextmining.com/index.html **