library(tidytext)
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(gutenbergr)
library(wordcloud)
## Loading required package: RColorBrewer
In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
In this assignment I will extend on the primary codes provided from chapter 2 in Text Mining with R, by using the sentiment dictionaries bing, nrc, loughran, on Alice in wonderland adventure from the gutenberg package in R.
Using get sentiment function in R I was able to download specific sentiment lexicons with the appropriate measures, where some lexicons requested I agreed to license before downloading. I downloaded sentiments for AFINN from Finn Årup Nielsen, with agreement: http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html.
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
Downloaded bing from Bing Liu and collaborators: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
Downloaded nrc from Saif Mohammad and Peter Turney, with agreement: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
The janeaustenr package in R has the six completed books frome Jane Austen. To find the most common joy words in in the book “Emma” by Austen, first the text were unnested to form a tidy format, and the functions grouped by and mutate were used to construct columns for each line and chapter.
For more information: https://github.com/juliasilge/janeaustenr
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
head(tidy_books)
## # A tibble: 6 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
Secondly, the filtered joy word from “Emma” using the filter function to filter joy words, inner join function to do a sentiment analysis, and the count function to get the count on how many times each word was used.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
head(nrc_joy)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
Used bing to find the negative and positive words in the each book by Austen and calculated the net sentiment.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(jane_austen_sentiment)
## # A tibble: 6 × 5
## book index negative positive sentiment
## <fct> <dbl> <int> <int> <int>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
Data visualization for the net sentiment for each book, the plot was against the index on the x axis which allows us to see how the sentiment changes over trajectory.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Compared the three sentiment dictionaries
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice") #filter words only from the the book "Pride and Prejudice"
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Visual of the three compared sentiment. The three sentiment dictionaries give different results, FINN gives the highest positive values more variance, Bing has the lowest positive values, and NRC has the least negative value.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
comparing nrc to bing, bing has higher negative words and nrc has higher
positive words.
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
Looked up a list of books from Lewis Caroll in gutenberg package, the package is a public domain with a collevct of works taht can be used to download and process.
read more: https://github.com/ropensci/gutenbergr
gutenberg_works(author== "Carroll, Lewis")
## # A tibble: 15 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <chr> <chr>
## 1 11 "Alice'… Carro… 7 en "Children's Litera…
## 2 13 "The Hu… Carro… 7 en "Children's Litera…
## 3 620 "Sylvie… Carro… 7 en ""
## 4 4763 "The Ga… Carro… 7 en "Philosophy"
## 5 19551 "Alice … Carro… 7 en ""
## 6 28696 "Symbol… Carro… 7 en "Philosophy"
## 7 28885 "Alice'… Carro… 7 en "Banned Books from…
## 8 29042 "A Tang… Carro… 7 en "Mathematics"
## 9 29888 "The Hu… Carro… 7 en ""
## 10 33582 "Rhyme?… Carro… 7 en ""
## 11 35497 "Three … Carro… 7 en ""
## 12 35535 "Feedin… Carro… 7 en ""
## 13 35688 "Alice … Carro… 7 en ""
## 14 36308 "Songs … Carro… 7 en ""
## 15 38065 "Eight … Carro… 7 en ""
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
Downloaded ALICE’S ADVENTURES IN WONDERLAND by Lewis Carroll from the gutenberg package in R.
Alice_in_wonderland_Adv<-gutenberg_download(28885)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
Tidy formart for the words in the text to be analyzed
Alice_in_wonderland_Adv_tidy<- Alice_in_wonderland_Adv %>%
mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
unnest_tokens(word, text)
Compared the three sentiment NRC, bing, and Afinn in a plot for Alice and wonderland adventure book. The NRC had the least negative words, afinn had the highest positive value, and the bing had the highest amount of negative words.
afinn2 <- Alice_in_wonderland_Adv_tidy %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc2 <- bind_rows(
Alice_in_wonderland_Adv_tidy %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
Alice_in_wonderland_Adv_tidy %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2229 of `x` matches multiple rows in `y`.
## ℹ Row 5004 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn2,
bing_and_nrc2) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Used loughran as the additional lexicon, used the loughran to look into positive and negative words. For Loughran I had to agree to license before downloading.
loughran_posneg <- get_sentiments("loughran") %>%
filter(sentiment == "positive" | sentiment =="negative")
AIWA_loughran <- Alice_in_wonderland_Adv_tidy %>%
inner_join(loughran_posneg) %>%
count(index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
Data visualization for loughran
par(mfrow=c(1,2))
ggplot(AIWA_loughran, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
Loughran sentiment has way more negative words than positive words, which explains why the plot for the Alice in wonderland adv loughran has so many negative sentiment values. The bing plot had more positive values for Alice in wonderland adv compared to the Loughran plot.
get_sentiments("loughran") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 2355
## 2 positive 354
Using sentiment lexicon I was able to analyze words that are most frequently used in documents that are catergorized as positive or negative. In my opinion I would use nrc because the amount of words they have for positive and negative are close and I feel this would help in the avoiding a bias of words. For example Loughran has a way higher count for negative words than positives words, therefore using Loughran most of the time will have more negative sentiment.
for base code: Silge, J. & Robinson, D. (2016). Welcome to Text Mining with R. O’Reilly Media.
Extented Analysis:
citation('gutenbergr')
## To cite package 'gutenbergr' in publications use:
##
## Johnston M, Robinson D (2023). _gutenbergr: Download and Process
## Public Domain Works from Project Gutenberg_. R package version 0.2.4,
## <https://CRAN.R-project.org/package=gutenbergr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {gutenbergr: Download and Process Public Domain Works from Project Gutenberg},
## author = {Myfanwy Johnston and David Robinson},
## year = {2023},
## note = {R package version 0.2.4},
## url = {https://CRAN.R-project.org/package=gutenbergr},
## }