First I will get the example from Chapter 2 running first.
Reproducing the Base Example
Taken from:
Silge, J., & Robinson, D. (2017). “Sentiment Analysis with Tidy Data.” Text Mining with R. Retrieved from https://www.tidytextmining.com/sentiment.html.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 122,204 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Pride & Prejudice 1 0 pride
2 Pride & Prejudice 1 0 and
3 Pride & Prejudice 1 0 prejudice
4 Pride & Prejudice 3 0 by
5 Pride & Prejudice 3 0 jane
6 Pride & Prejudice 3 0 austen
7 Pride & Prejudice 7 1 chapter
8 Pride & Prejudice 7 1 1
9 Pride & Prejudice 10 1 it
10 Pride & Prejudice 10 1 is
# ℹ 122,194 more rows
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bing_word_counts %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment",y =NULL)
# A tibble: 1,150 × 2
word lexicon
<chr> <chr>
1 miss custom
2 a SMART
3 a's SMART
4 able SMART
5 about SMART
6 above SMART
7 according SMART
8 accordingly SMART
9 across SMART
10 actually SMART
# ℹ 1,140 more rows
library(wordcloud)
Loading required package: RColorBrewer
tidy_books %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud(word, n, max.words =100))
Joining with `by = join_by(word)`
Warning in wordcloud(word, n, max.words = 100): miss could not be fit on page.
It will not be plotted.
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Extend the Analysis
For my extended analysis, I choose to add the Loughran-McDonald lexicon and use The Adventures of Sherlock Holmes within the gutenbergr library as my new text corpus.
First I grabbed the Loughran-McDonald lexicon from the textdata library.
Next I downloaded The Adventures of Sherlock Holmes from gutenbergr with the ID value 1661 and tidied the data like the base example by using mutate and unnest_tokens.
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1790 of `x` matches multiple rows in `y`.
ℹ Row 108 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Joining with `by = join_by(word)`
Here I compared the 4 different sentiment lexicons.
bind_rows(afinn, compare_lexicons) %>%ggplot(aes(index, sentiment, fill = method)) +geom_col(show.legend =FALSE) +facet_wrap(~method, ncol =1, scales ="free_y") +theme_minimal() +labs(title ="Comparing Sentiment Lexicons: Sherlock Holmes",subtitle ="AFINN vs. Bing vs. Loughran vs. NRC",y ="Net Sentiment Score",x ="Narrative Progress (80-line chunks)")
Here I did a count check on the loughran lexicon, it had different categories compared to the NRC and bing lexicons.
Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 240 of `x` matches multiple rows in `y`.
ℹ Row 2679 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Then I plotted the results of the loughran word counts.
loughran_word_counts %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free") +labs(x ="Contribution to sentiment",y =NULL)
Here I added the word miss to stop_words because it comes up often.
# A tibble: 1,150 × 2
word lexicon
<chr> <chr>
1 miss custom
2 a SMART
3 a's SMART
4 able SMART
5 about SMART
6 above SMART
7 according SMART
8 accordingly SMART
9 across SMART
10 actually SMART
# ℹ 1,140 more rows
Here I created a word cloud for sherlock_tidy.
library(wordcloud)sherlock_tidy %>%anti_join(custom_stop_words) %>%count(word) %>%with(wordcloud(word, n, max.words =100))
Joining with `by = join_by(word)`
Then I created a comparison cloud for sherlock_tidy using the bing sentiments.
Since the Loughran lexicon is used mostly for financial text, it was interesting to use it for an older novel like The Adventures of Sherlock Holmes. The results from the sentiment analysis of The Adventures of Sherlock Holmes differs from that of the Jane Austen books in the fact that it is more negative with bigger spikes. This is apparent in the ggplots and is likely due to the fact that Sherlock Holmes is a novel that deals with crime and mystery. The results of the Loughran word count was also more negative compared to the other 3 lexicons used, likely due to the fact that it is used mostly for financial text.