Assignment Prompt

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing,
and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Text Mining with R

In Text Mining with R, Chapter 2 looks at Sentiment Analysis.

Load Lexicon

We observe three lexicons from the ‘textdata’ package in R. All three lexicons contains unigrams.

library(tidytext)
library(textdata)

get_sentiments("afinn")

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

Sentiment Analysis of 6 Jane Austin Books

Load corpus in R

The ‘janeaustenr’ package contains text from 6 Jane Austin’s completed books: “Sense & Sensibility”, “Pride & Prejudice”, “Mansfield Park”, “Emma”, “Northanger Abbey”, and “Persuasion”.

library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

knitr::kable(head(tidy_books, 50), caption = "This table contain the first 100 observation of the tidy_books dataframe.")

This table contain the first 100 observation of the tidy_books dataframe.
book	linenumber	chapter	word
Sense & Sensibility	1	0	sense
Sense & Sensibility	1	0	and
Sense & Sensibility	1	0	sensibility
Sense & Sensibility	3	0	by
Sense & Sensibility	3	0	jane
Sense & Sensibility	3	0	austen
Sense & Sensibility	5	0	1811
Sense & Sensibility	10	1	chapter
Sense & Sensibility	10	1	1
Sense & Sensibility	13	1	the
Sense & Sensibility	13	1	family
Sense & Sensibility	13	1	of
Sense & Sensibility	13	1	dashwood
Sense & Sensibility	13	1	had
Sense & Sensibility	13	1	long
Sense & Sensibility	13	1	been
Sense & Sensibility	13	1	settled
Sense & Sensibility	13	1	in
Sense & Sensibility	13	1	sussex
Sense & Sensibility	13	1	their
Sense & Sensibility	13	1	estate
Sense & Sensibility	14	1	was
Sense & Sensibility	14	1	large
Sense & Sensibility	14	1	and
Sense & Sensibility	14	1	their
Sense & Sensibility	14	1	residence
Sense & Sensibility	14	1	was
Sense & Sensibility	14	1	at
Sense & Sensibility	14	1	norland
Sense & Sensibility	14	1	park
Sense & Sensibility	14	1	in
Sense & Sensibility	14	1	the
Sense & Sensibility	14	1	centre
Sense & Sensibility	14	1	of
Sense & Sensibility	15	1	their
Sense & Sensibility	15	1	property
Sense & Sensibility	15	1	where
Sense & Sensibility	15	1	for
Sense & Sensibility	15	1	many
Sense & Sensibility	15	1	generations
Sense & Sensibility	15	1	they
Sense & Sensibility	15	1	had
Sense & Sensibility	15	1	lived
Sense & Sensibility	15	1	in
Sense & Sensibility	15	1	so
Sense & Sensibility	16	1	respectable
Sense & Sensibility	16	1	a
Sense & Sensibility	16	1	manner
Sense & Sensibility	16	1	as
Sense & Sensibility	16	1	to

Filter for Joy Words from “Emma”

Use the ‘filter’ function from the ‘dplyr’ package to filter joy words from the ‘ncr’ lexicon and filter text from the book “Emma” and inner join both dataframe.

library(DT)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

Emma <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

datatable(Emma)

Number of Positive and Negative Sentiment for Each Book

Using ‘bing’, a lexicon package, count the number of positive and negative words for each book. Then, calculate the net sentiment (postive - negative).

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

datatable(jane_austen_sentiment)

Data Visualiztion of Sentiment Changes Over the Plot Trajectory

The net sentiment is plot aqainst the narrative time (index on x-axis). We can view how net sentiment changes over the plot trajectory.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the Three Sentiment Dictionaries

Compare how the sentiment changes in Jane Ausin’s “Pride and Prejuduce” using three sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice

## # A tibble: 122,204 × 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # … with 122,194 more rows

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Extended Practice

Extend the code in two ways:

Work with a different corpus of your choosing,
and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Romeo and Juliet

Download the text from Gutenburg

Based on the Gutenburg website, the most popular book is “Romeo and Juliet”. We will run sentimental analysis on the most popular book.

We need to find the id number of the book to download the text into our R. The ‘gutenberg_works’ function a table of Gutenburg metadata. Use the ‘filter’ function to filter only information related to “Romeo and Juliet”. Use ‘gutenberg_download()’ function to download the text.

library(gutenbergr)
library(DT)
gutenberg_works() %>%
  filter(title == "Romeo and Juliet")

## # A tibble: 1 × 8
##   gutenberg_id title            author    guten…¹ langu…² guten…³ rights has_t…⁴
##          <int> <chr>            <chr>       <int> <chr>   <chr>   <chr>  <lgl>  
## 1         1513 Romeo and Juliet Shakespe…      65 en      <NA>    Publi… TRUE   
## # … with abbreviated variable names ¹gutenberg_author_id, ²language,
## #   ³gutenberg_bookshelf, ⁴has_text

romeo_and_juliet <- gutenberg_download(1513)

knitr::kable(head(romeo_and_juliet, 50), caption = "This table contain the first 50 lines of 'Romeo and Juliet'.")

This table contain the first 50 lines of ‘Romeo and Juliet’.
gutenberg_id	text
1513	THE TRAGEDY OF ROMEO AND JULIET
1513
1513
1513
1513	by William Shakespeare
1513
1513
1513	Contents
1513
1513	THE PROLOGUE.
1513
1513	ACT I
1513	Scene I. A public place.
1513	Scene II. A Street.
1513	Scene III. Room in Capulet’s House.
1513	Scene IV. A Street.
1513	Scene V. A Hall in Capulet’s House.
1513
1513
1513	ACT II
1513	CHORUS.
1513	Scene I. An open place adjoining Capulet’s Garden.
1513	Scene II. Capulet’s Garden.
1513	Scene III. Friar Lawrence’s Cell.
1513	Scene IV. A Street.
1513	Scene V. Capulet’s Garden.
1513	Scene VI. Friar Lawrence’s Cell.
1513
1513
1513	ACT III
1513	Scene I. A public Place.
1513	Scene II. A Room in Capulet’s House.
1513	Scene III. Friar Lawrence’s cell.
1513	Scene IV. A Room in Capulet’s House.
1513	Scene V. An open Gallery to Juliet’s Chamber, overlooking the Garden.
1513
1513
1513	ACT IV
1513	Scene I. Friar Lawrence’s Cell.
1513	Scene II. Hall in Capulet’s House.
1513	Scene III. Juliet’s Chamber.
1513	Scene IV. Hall in Capulet’s House.
1513	Scene V. Juliet’s Chamber; Juliet on the bed.
1513
1513
1513	ACT V
1513	Scene I. Mantua. A Street.
1513	Scene II. Friar Lawrence’s Cell.
1513	Scene III. A churchyard; in it a Monument belonging to the Capulets.
1513

Tidy text

In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.

romeo_and_juliet <- romeo_and_juliet[c("text")] %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

datatable(head(romeo_and_juliet,100))

Number of Positive and Negative Sentiment

The top words associated with negative sentiment are related to “death”. The top word associated with positive sentiment is “love”.

library(wordcloud)
library(reshape2)

word_count <- romeo_and_juliet %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

word_count %>%
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

word_count%>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "green"),
                   max.words = 100)

NRC

“Death” appears in the anger, anticipation, disgust, fear, negative, sadness, and surprise.

romeo_and_juliet_nrc <- romeo_and_juliet %>% 
    inner_join(get_sentiments("nrc")) %>%
    count(word, sentiment)
romeo_and_juliet_nrc

## # A tibble: 2,238 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 abuse    anger         1
##  2 abuse    disgust       1
##  3 abuse    fear          1
##  4 abuse    negative      1
##  5 abuse    sadness       1
##  6 accident fear          1
##  7 accident negative      1
##  8 accident sadness       1
##  9 accident surprise      1
## 10 account  trust         2
## # … with 2,228 more rows

romeo_and_juliet_nrc %>% 
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

Net Sentiment

There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .

Compare how the sentiment changes in “Romeo and Juliet” using four sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book. There is negative sentiment at the end of the book. This is not surprising as Romeo and Juliet had a tragic ending.

afinn1 <- romeo_and_juliet %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc1 <- bind_rows(
  romeo_and_juliet %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  romeo_and_juliet %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC"),
    romeo_and_juliet %>% 
    inner_join(get_sentiments("loughran") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "loughran")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn1, 
          bing_and_nrc1) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Top 5 Popular Books on Gutenburg

According to Gutenburg, the top 5 ebooks are: “Romeo and Juliet” by William Shakespeare, “A Room with a View” by E. M. Forster, “Middlemarch” by George Eliot, “Moby Dick; Or, The Whale” by Herman Melville, and “Little Women; Or, Meg, Jo, Beth, and Amy” by Louisa May Alcott.

library(gutenbergr)
library(DT)
gutenberg_works() %>%
  filter(title %in%  c("Romeo and Juliet","A Room with a View","Middlemarch","Moby Dick; Or, The Whale", "Little Women; Or, Meg, Jo, Beth, and Amy"))

## # A tibble: 5 × 8
##   gutenberg_id title               author guten…¹ langu…² guten…³ rights has_t…⁴
##          <int> <chr>               <chr>    <int> <chr>   <chr>   <chr>  <lgl>  
## 1          145 Middlemarch         Eliot…      90 en      Best B… Publi… TRUE   
## 2         1513 Romeo and Juliet    Shake…      65 en      <NA>    Publi… TRUE   
## 3         2489 Moby Dick; Or, The… Melvi…       9 en      Best B… Publi… TRUE   
## 4         2641 A Room with a View  Forst…     975 en      Italy   Publi… TRUE   
## 5        37106 Little Women; Or, … Alcot…     102 en      <NA>    Publi… TRUE   
## # … with abbreviated variable names ¹gutenberg_author_id, ²language,
## #   ³gutenberg_bookshelf, ⁴has_text

top_5_books <- gutenberg_download(c(1513, 2641, 145, 2489, 37106))

Tidy text

In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.

top_5_books <- top_5_books %>%
    group_by(gutenberg_id) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

top_5_books$gutenberg_id[grep("145", top_5_books$gutenberg_id)] <- "Middlemarch"
top_5_books$gutenberg_id[grep("1513", top_5_books$gutenberg_id)] <- "Romeo and Juliet"
top_5_books$gutenberg_id[grep("2489", top_5_books$gutenberg_id)] <- "Moby Dick; Or, The Whale"
top_5_books$gutenberg_id[grep("2641", top_5_books$gutenberg_id)] <- "A Room with a View"
top_5_books$gutenberg_id[grep("37106", top_5_books$gutenberg_id)] <- "Little Women"

colnames(top_5_books)[1] <- "book"

datatable(head(top_5_books,100))

Percentage of Sentiment in Each Book

Out of all 5 books, “Little Women” has the highest percentage of positive words. “Romeo and Juliet” has the highest percentage of negative words.

top_5_books_bing <- top_5_books %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(book) %>%
  count(sentiment, sort = TRUE) %>%
  ungroup()

top_5_books_bing <- top_5_books_bing  %>%
  group_by(book) %>%
  mutate(percentage =  (n/sum(n)))

library(scales)
top_5 <- top_5_books_bing 
top_5$percentage <- percent(top_5$percentage, accuracy = 1)
datatable(top_5)

Visualization of Percentage of Sentiment in Each Book

Only two books has a higher percentage of positive words than negative words: “Middlemarch” and “Little Women”. “Little Women” has the highest percentage of positive words and lowest percentage of negative words.

ggplot(top_5_books_bing, aes(x=book, y=percentage, fill=sentiment)) +
    geom_bar(stat='identity', position='dodge')+
  coord_flip() + scale_y_continuous(labels = scales::percent)

Sentiment Changes Over the Plot Trajectory

top_5_books_sentiment <- top_5_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

datatable(head(top_5_books_sentiment,100))

Data Visualization of Sentiment Changes Over the Plot Trajectory

“A Room with a View” plot started with negative sentiment and ended with negative sentiment.

The “Little Women” was contains mostly positive sentiment throughout the plot.

“Romeo and Juliet” plot started with negative sentiment and ended with negative sentiment. This was not surprising as Romeo and Juliet faced a lot of obstacles and had a tragic ending.

“Middlemarch” plot started with positive sentiment and ended with positive sentiment.

“Moby Dick” ended with negative sentiment.

ggplot(top_5_books_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Financial News

There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .

On Kaggle , I found a dataset that contains financial news headlines.

Tidy text

In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.

library(tidyr)
library(dplyr)
library(stringr)
raw_financial <- read.delim(file = "https://raw.githubusercontent.com/suswong/Data-607-Assignments/main/all-data.csv", header = FALSE, sep = ",")

datatable(raw_financial)

colnames(raw_financial) <- c("sendiment","text")

tidy_financial <- raw_financial[-1] %>%
  mutate(
    linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Sentiment Analysis

get_sentiments("loughran")

## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,140 more rows

financial_sentiment <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, index = linenumber, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)

g <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(sentiment) 

ggplot(g, aes(x= reorder(sentiment, n), y=n)) +
  geom_bar(stat="identity") + coord_flip()

word_count_financial <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE)

datatable(word_count_financial)

word_count_financial %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "green"),
                   max.words = 100)

Conclusion

Romeo and Juliet

With the graph visualization of the net sentiment of the plot, we can see that there are periods of up and down sentiment. However, ultimately, the plot ended with negative sentiment. This is not surprising as star crossed lovers had a tragic ending.

Top 5 Ebooks on Gutenberg

According to Gutenburg, the top 5 ebooks are:

“Romeo and Juliet” by William Shakespeare
“A Room with a View” by E. M. Forster
“Middlemarch” by George Eliot
“Moby Dick; Or, The Whale” by Herman Melville
“Little Women; Or, Meg, Jo, Beth, and Amy” by Louisa May Alcott.

Out of all 5 books, “Little Women” has the highest percentage of positive words and “Romeo and Juliet” has the highest percentage of negative words. Both “A Room with a View” and “Romeo and Juliet” plot started with negative sentiment and ended with negative sentiment. “Middlemarch” plot started with positive sentiment and ended with positive sentiment. “Moby Dick” ended with negative sentiment.

Source

Silge, J., & Robinson, D. (2017). Text mining with R: A tydy approach. O´Reilly.
Loughran-McDonald sentiment lexicon — lexicon_loughran. (n.d.). https://emilhvitfeldt.github.io/textdata/reference/lexicon_loughran.html
Sentiment Analysis for Financial News. (2020, May 27). Kaggle. https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news
Project Gutenberg. (n.d.). Project Gutenberg. https://www.gutenberg.org/

DATA 607 Sentiment Analysis

Susanna Wong

2023-03-29

Assignment Prompt

Text Mining with R

Load Lexicon

Sentiment Analysis of 6 Jane Austin Books

Load corpus in R

Filter for Joy Words from “Emma”

Number of Positive and Negative Sentiment for Each Book

Data Visualiztion of Sentiment Changes Over the Plot Trajectory

Comparing the Three Sentiment Dictionaries

Extended Practice

Romeo and Juliet

Download the text from Gutenburg

Tidy text

Number of Positive and Negative Sentiment

NRC

Net Sentiment

Top 5 Popular Books on Gutenburg

Tidy text

Percentage of Sentiment in Each Book

Visualization of Percentage of Sentiment in Each Book

Sentiment Changes Over the Plot Trajectory

Data Visualization of Sentiment Changes Over the Plot Trajectory

Financial News

Tidy text

Sentiment Analysis

Conclusion

Source