In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing,
and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
In Text Mining with R, Chapter 2 looks at Sentiment Analysis.
We observe three lexicons from the ‘textdata’ package in R. All three lexicons contains unigrams.
library(tidytext)
library(textdata)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
The ‘janeaustenr’ package contains text from 6 Jane Austin’s completed books: “Sense & Sensibility”, “Pride & Prejudice”, “Mansfield Park”, “Emma”, “Northanger Abbey”, and “Persuasion”.
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
knitr::kable(head(tidy_books, 50), caption = "This table contain the first 100 observation of the tidy_books dataframe.")
book | linenumber | chapter | word |
---|---|---|---|
Sense & Sensibility | 1 | 0 | sense |
Sense & Sensibility | 1 | 0 | and |
Sense & Sensibility | 1 | 0 | sensibility |
Sense & Sensibility | 3 | 0 | by |
Sense & Sensibility | 3 | 0 | jane |
Sense & Sensibility | 3 | 0 | austen |
Sense & Sensibility | 5 | 0 | 1811 |
Sense & Sensibility | 10 | 1 | chapter |
Sense & Sensibility | 10 | 1 | 1 |
Sense & Sensibility | 13 | 1 | the |
Sense & Sensibility | 13 | 1 | family |
Sense & Sensibility | 13 | 1 | of |
Sense & Sensibility | 13 | 1 | dashwood |
Sense & Sensibility | 13 | 1 | had |
Sense & Sensibility | 13 | 1 | long |
Sense & Sensibility | 13 | 1 | been |
Sense & Sensibility | 13 | 1 | settled |
Sense & Sensibility | 13 | 1 | in |
Sense & Sensibility | 13 | 1 | sussex |
Sense & Sensibility | 13 | 1 | their |
Sense & Sensibility | 13 | 1 | estate |
Sense & Sensibility | 14 | 1 | was |
Sense & Sensibility | 14 | 1 | large |
Sense & Sensibility | 14 | 1 | and |
Sense & Sensibility | 14 | 1 | their |
Sense & Sensibility | 14 | 1 | residence |
Sense & Sensibility | 14 | 1 | was |
Sense & Sensibility | 14 | 1 | at |
Sense & Sensibility | 14 | 1 | norland |
Sense & Sensibility | 14 | 1 | park |
Sense & Sensibility | 14 | 1 | in |
Sense & Sensibility | 14 | 1 | the |
Sense & Sensibility | 14 | 1 | centre |
Sense & Sensibility | 14 | 1 | of |
Sense & Sensibility | 15 | 1 | their |
Sense & Sensibility | 15 | 1 | property |
Sense & Sensibility | 15 | 1 | where |
Sense & Sensibility | 15 | 1 | for |
Sense & Sensibility | 15 | 1 | many |
Sense & Sensibility | 15 | 1 | generations |
Sense & Sensibility | 15 | 1 | they |
Sense & Sensibility | 15 | 1 | had |
Sense & Sensibility | 15 | 1 | lived |
Sense & Sensibility | 15 | 1 | in |
Sense & Sensibility | 15 | 1 | so |
Sense & Sensibility | 16 | 1 | respectable |
Sense & Sensibility | 16 | 1 | a |
Sense & Sensibility | 16 | 1 | manner |
Sense & Sensibility | 16 | 1 | as |
Sense & Sensibility | 16 | 1 | to |
Use the ‘filter’ function from the ‘dplyr’ package to filter joy words from the ‘ncr’ lexicon and filter text from the book “Emma” and inner join both dataframe.
library(DT)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
Emma <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
datatable(Emma)
Using ‘bing’, a lexicon package, count the number of positive and negative words for each book. Then, calculate the net sentiment (postive - negative).
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
datatable(jane_austen_sentiment)
The net sentiment is plot aqainst the narrative time (index on x-axis). We can view how net sentiment changes over the plot trajectory.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Compare how the sentiment changes in Jane Ausin’s “Pride and Prejuduce” using three sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # … with 122,194 more rows
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Extend the code in two ways:
Work with a different corpus of your choosing,
and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
Based on the Gutenburg website, the most popular book is “Romeo and Juliet”. We will run sentimental analysis on the most popular book.
We need to find the id number of the book to download the text into our R. The ‘gutenberg_works’ function a table of Gutenburg metadata. Use the ‘filter’ function to filter only information related to “Romeo and Juliet”. Use ‘gutenberg_download()’ function to download the text.
library(gutenbergr)
library(DT)
gutenberg_works() %>%
filter(title == "Romeo and Juliet")
## # A tibble: 1 × 8
## gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
## 1 1513 Romeo and Juliet Shakespe… 65 en <NA> Publi… TRUE
## # … with abbreviated variable names ¹gutenberg_author_id, ²language,
## # ³gutenberg_bookshelf, ⁴has_text
romeo_and_juliet <- gutenberg_download(1513)
knitr::kable(head(romeo_and_juliet, 50), caption = "This table contain the first 50 lines of 'Romeo and Juliet'.")
gutenberg_id | text |
---|---|
1513 | THE TRAGEDY OF ROMEO AND JULIET |
1513 | |
1513 | |
1513 | |
1513 | by William Shakespeare |
1513 | |
1513 | |
1513 | Contents |
1513 | |
1513 | THE PROLOGUE. |
1513 | |
1513 | ACT I |
1513 | Scene I. A public place. |
1513 | Scene II. A Street. |
1513 | Scene III. Room in Capulet’s House. |
1513 | Scene IV. A Street. |
1513 | Scene V. A Hall in Capulet’s House. |
1513 | |
1513 | |
1513 | ACT II |
1513 | CHORUS. |
1513 | Scene I. An open place adjoining Capulet’s Garden. |
1513 | Scene II. Capulet’s Garden. |
1513 | Scene III. Friar Lawrence’s Cell. |
1513 | Scene IV. A Street. |
1513 | Scene V. Capulet’s Garden. |
1513 | Scene VI. Friar Lawrence’s Cell. |
1513 | |
1513 | |
1513 | ACT III |
1513 | Scene I. A public Place. |
1513 | Scene II. A Room in Capulet’s House. |
1513 | Scene III. Friar Lawrence’s cell. |
1513 | Scene IV. A Room in Capulet’s House. |
1513 | Scene V. An open Gallery to Juliet’s Chamber, overlooking the Garden. |
1513 | |
1513 | |
1513 | ACT IV |
1513 | Scene I. Friar Lawrence’s Cell. |
1513 | Scene II. Hall in Capulet’s House. |
1513 | Scene III. Juliet’s Chamber. |
1513 | Scene IV. Hall in Capulet’s House. |
1513 | Scene V. Juliet’s Chamber; Juliet on the bed. |
1513 | |
1513 | |
1513 | ACT V |
1513 | Scene I. Mantua. A Street. |
1513 | Scene II. Friar Lawrence’s Cell. |
1513 | Scene III. A churchyard; in it a Monument belonging to the Capulets. |
1513 |
In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.
romeo_and_juliet <- romeo_and_juliet[c("text")] %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
datatable(head(romeo_and_juliet,100))
The top words associated with negative sentiment are related to “death”. The top word associated with positive sentiment is “love”.
library(wordcloud)
library(reshape2)
word_count <- romeo_and_juliet %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
word_count %>%
group_by(sentiment)%>%
top_n(10)%>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
word_count%>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "green"),
max.words = 100)
“Death” appears in the anger, anticipation, disgust, fear, negative, sadness, and surprise.
romeo_and_juliet_nrc <- romeo_and_juliet %>%
inner_join(get_sentiments("nrc")) %>%
count(word, sentiment)
romeo_and_juliet_nrc
## # A tibble: 2,238 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 abuse anger 1
## 2 abuse disgust 1
## 3 abuse fear 1
## 4 abuse negative 1
## 5 abuse sadness 1
## 6 accident fear 1
## 7 accident negative 1
## 8 accident sadness 1
## 9 accident surprise 1
## 10 account trust 2
## # … with 2,228 more rows
romeo_and_juliet_nrc %>%
group_by(sentiment)%>%
top_n(10)%>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .
Compare how the sentiment changes in “Romeo and Juliet” using four sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book. There is negative sentiment at the end of the book. This is not surprising as Romeo and Juliet had a tragic ending.
afinn1 <- romeo_and_juliet %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc1 <- bind_rows(
romeo_and_juliet %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
romeo_and_juliet %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC"),
romeo_and_juliet %>%
inner_join(get_sentiments("loughran") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "loughran")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn1,
bing_and_nrc1) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
According to Gutenburg, the top 5 ebooks are: “Romeo and Juliet” by William Shakespeare, “A Room with a View” by E. M. Forster, “Middlemarch” by George Eliot, “Moby Dick; Or, The Whale” by Herman Melville, and “Little Women; Or, Meg, Jo, Beth, and Amy” by Louisa May Alcott.
library(gutenbergr)
library(DT)
gutenberg_works() %>%
filter(title %in% c("Romeo and Juliet","A Room with a View","Middlemarch","Moby Dick; Or, The Whale", "Little Women; Or, Meg, Jo, Beth, and Amy"))
## # A tibble: 5 × 8
## gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
## 1 145 Middlemarch Eliot… 90 en Best B… Publi… TRUE
## 2 1513 Romeo and Juliet Shake… 65 en <NA> Publi… TRUE
## 3 2489 Moby Dick; Or, The… Melvi… 9 en Best B… Publi… TRUE
## 4 2641 A Room with a View Forst… 975 en Italy Publi… TRUE
## 5 37106 Little Women; Or, … Alcot… 102 en <NA> Publi… TRUE
## # … with abbreviated variable names ¹gutenberg_author_id, ²language,
## # ³gutenberg_bookshelf, ⁴has_text
top_5_books <- gutenberg_download(c(1513, 2641, 145, 2489, 37106))
In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.
top_5_books <- top_5_books %>%
group_by(gutenberg_id) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
top_5_books$gutenberg_id[grep("145", top_5_books$gutenberg_id)] <- "Middlemarch"
top_5_books$gutenberg_id[grep("1513", top_5_books$gutenberg_id)] <- "Romeo and Juliet"
top_5_books$gutenberg_id[grep("2489", top_5_books$gutenberg_id)] <- "Moby Dick; Or, The Whale"
top_5_books$gutenberg_id[grep("2641", top_5_books$gutenberg_id)] <- "A Room with a View"
top_5_books$gutenberg_id[grep("37106", top_5_books$gutenberg_id)] <- "Little Women"
colnames(top_5_books)[1] <- "book"
datatable(head(top_5_books,100))
Out of all 5 books, “Little Women” has the highest percentage of positive words. “Romeo and Juliet” has the highest percentage of negative words.
top_5_books_bing <- top_5_books %>%
inner_join(get_sentiments("bing")) %>%
group_by(book) %>%
count(sentiment, sort = TRUE) %>%
ungroup()
top_5_books_bing <- top_5_books_bing %>%
group_by(book) %>%
mutate(percentage = (n/sum(n)))
library(scales)
top_5 <- top_5_books_bing
top_5$percentage <- percent(top_5$percentage, accuracy = 1)
datatable(top_5)
Only two books has a higher percentage of positive words than negative words: “Middlemarch” and “Little Women”. “Little Women” has the highest percentage of positive words and lowest percentage of negative words.
ggplot(top_5_books_bing, aes(x=book, y=percentage, fill=sentiment)) +
geom_bar(stat='identity', position='dodge')+
coord_flip() + scale_y_continuous(labels = scales::percent)
top_5_books_sentiment <- top_5_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
datatable(head(top_5_books_sentiment,100))
“A Room with a View” plot started with negative sentiment and ended with negative sentiment.
The “Little Women” was contains mostly positive sentiment throughout the plot.
“Romeo and Juliet” plot started with negative sentiment and ended with negative sentiment. This was not surprising as Romeo and Juliet faced a lot of obstacles and had a tragic ending.
“Middlemarch” plot started with positive sentiment and ended with positive sentiment.
“Moby Dick” ended with negative sentiment.
ggplot(top_5_books_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .
On Kaggle , I found a dataset that contains financial news headlines.
In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.
library(tidyr)
library(dplyr)
library(stringr)
raw_financial <- read.delim(file = "https://raw.githubusercontent.com/suswong/Data-607-Assignments/main/all-data.csv", header = FALSE, sep = ",")
datatable(raw_financial)
colnames(raw_financial) <- c("sendiment","text")
tidy_financial <- raw_financial[-1] %>%
mutate(
linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
get_sentiments("loughran")
## # A tibble: 4,150 × 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # … with 4,140 more rows
financial_sentiment <- tidy_financial %>%
inner_join(get_sentiments("loughran")) %>%
count(word, index = linenumber, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)
g <- tidy_financial %>%
inner_join(get_sentiments("loughran")) %>%
count(sentiment)
ggplot(g, aes(x= reorder(sentiment, n), y=n)) +
geom_bar(stat="identity") + coord_flip()
word_count_financial <- tidy_financial %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE)
datatable(word_count_financial)
word_count_financial %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "green"),
max.words = 100)
Romeo and Juliet
With the graph visualization of the net sentiment of the plot, we can see that there are periods of up and down sentiment. However, ultimately, the plot ended with negative sentiment. This is not surprising as star crossed lovers had a tragic ending.
Top 5 Ebooks on Gutenberg
According to Gutenburg, the top 5 ebooks are:
“Romeo and Juliet” by William Shakespeare
“A Room with a View” by E. M. Forster
“Middlemarch” by George Eliot
“Moby Dick; Or, The Whale” by Herman Melville
“Little Women; Or, Meg, Jo, Beth, and Amy” by Louisa May Alcott.
Out of all 5 books, “Little Women” has the highest percentage of positive words and “Romeo and Juliet” has the highest percentage of negative words. Both “A Room with a View” and “Romeo and Juliet” plot started with negative sentiment and ended with negative sentiment. “Middlemarch” plot started with positive sentiment and ended with positive sentiment. “Moby Dick” ended with negative sentiment.
Silge, J., & Robinson, D. (2017). Text mining with R: A tydy approach. O´Reilly.
Loughran-McDonald sentiment lexicon — lexicon_loughran. (n.d.). https://emilhvitfeldt.github.io/textdata/reference/lexicon_loughran.html
Sentiment Analysis for Financial News. (2020, May 27). Kaggle. https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news
Project Gutenberg. (n.d.). Project Gutenberg. https://www.gutenberg.org/