Assignment Prompt

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

  • Work with a different corpus of your choosing,

  • and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Text Mining with R

In Text Mining with R, Chapter 2 looks at Sentiment Analysis.

Load Lexicon

We observe three lexicons from the ‘textdata’ package in R. All three lexicons contains unigrams.

library(tidytext)
library(textdata)

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

Sentiment Analysis of 6 Jane Austin Books

Load corpus in R

The ‘janeaustenr’ package contains text from 6 Jane Austin’s completed books: “Sense & Sensibility”, “Pride & Prejudice”, “Mansfield Park”, “Emma”, “Northanger Abbey”, and “Persuasion”.

library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

knitr::kable(head(tidy_books, 50), caption = "This table contain the first 100 observation of the tidy_books dataframe.")
This table contain the first 100 observation of the tidy_books dataframe.
book linenumber chapter word
Sense & Sensibility 1 0 sense
Sense & Sensibility 1 0 and
Sense & Sensibility 1 0 sensibility
Sense & Sensibility 3 0 by
Sense & Sensibility 3 0 jane
Sense & Sensibility 3 0 austen
Sense & Sensibility 5 0 1811
Sense & Sensibility 10 1 chapter
Sense & Sensibility 10 1 1
Sense & Sensibility 13 1 the
Sense & Sensibility 13 1 family
Sense & Sensibility 13 1 of
Sense & Sensibility 13 1 dashwood
Sense & Sensibility 13 1 had
Sense & Sensibility 13 1 long
Sense & Sensibility 13 1 been
Sense & Sensibility 13 1 settled
Sense & Sensibility 13 1 in
Sense & Sensibility 13 1 sussex
Sense & Sensibility 13 1 their
Sense & Sensibility 13 1 estate
Sense & Sensibility 14 1 was
Sense & Sensibility 14 1 large
Sense & Sensibility 14 1 and
Sense & Sensibility 14 1 their
Sense & Sensibility 14 1 residence
Sense & Sensibility 14 1 was
Sense & Sensibility 14 1 at
Sense & Sensibility 14 1 norland
Sense & Sensibility 14 1 park
Sense & Sensibility 14 1 in
Sense & Sensibility 14 1 the
Sense & Sensibility 14 1 centre
Sense & Sensibility 14 1 of
Sense & Sensibility 15 1 their
Sense & Sensibility 15 1 property
Sense & Sensibility 15 1 where
Sense & Sensibility 15 1 for
Sense & Sensibility 15 1 many
Sense & Sensibility 15 1 generations
Sense & Sensibility 15 1 they
Sense & Sensibility 15 1 had
Sense & Sensibility 15 1 lived
Sense & Sensibility 15 1 in
Sense & Sensibility 15 1 so
Sense & Sensibility 16 1 respectable
Sense & Sensibility 16 1 a
Sense & Sensibility 16 1 manner
Sense & Sensibility 16 1 as
Sense & Sensibility 16 1 to

Filter for Joy Words from “Emma”

Use the ‘filter’ function from the ‘dplyr’ package to filter joy words from the ‘ncr’ lexicon and filter text from the book “Emma” and inner join both dataframe.

library(DT)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

Emma <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

datatable(Emma)

Number of Positive and Negative Sentiment for Each Book

Using ‘bing’, a lexicon package, count the number of positive and negative words for each book. Then, calculate the net sentiment (postive - negative).

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

datatable(jane_austen_sentiment)

Data Visualiztion of Sentiment Changes Over the Plot Trajectory

The net sentiment is plot aqainst the narrative time (index on x-axis). We can view how net sentiment changes over the plot trajectory.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the Three Sentiment Dictionaries

Compare how the sentiment changes in Jane Ausin’s “Pride and Prejuduce” using three sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice 
## # A tibble: 122,204 × 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # … with 122,194 more rows
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Extended Practice

Extend the code in two ways:

  • Work with a different corpus of your choosing,

  • and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Romeo and Juliet

Download the text from Gutenburg

Based on the Gutenburg website, the most popular book is “Romeo and Juliet”. We will run sentimental analysis on the most popular book.

We need to find the id number of the book to download the text into our R. The ‘gutenberg_works’ function a table of Gutenburg metadata. Use the ‘filter’ function to filter only information related to “Romeo and Juliet”. Use ‘gutenberg_download()’ function to download the text.

library(gutenbergr)
library(DT)
gutenberg_works() %>%
  filter(title == "Romeo and Juliet")
## # A tibble: 1 × 8
##   gutenberg_id title            author    guten…¹ langu…² guten…³ rights has_t…⁴
##          <int> <chr>            <chr>       <int> <chr>   <chr>   <chr>  <lgl>  
## 1         1513 Romeo and Juliet Shakespe…      65 en      <NA>    Publi… TRUE   
## # … with abbreviated variable names ¹​gutenberg_author_id, ²​language,
## #   ³​gutenberg_bookshelf, ⁴​has_text
romeo_and_juliet <- gutenberg_download(1513)

knitr::kable(head(romeo_and_juliet, 50), caption = "This table contain the first 50 lines of 'Romeo and Juliet'.")
This table contain the first 50 lines of ‘Romeo and Juliet’.
gutenberg_id text
1513 THE TRAGEDY OF ROMEO AND JULIET
1513
1513
1513
1513 by William Shakespeare
1513
1513
1513 Contents
1513
1513 THE PROLOGUE.
1513
1513 ACT I
1513 Scene I. A public place.
1513 Scene II. A Street.
1513 Scene III. Room in Capulet’s House.
1513 Scene IV. A Street.
1513 Scene V. A Hall in Capulet’s House.
1513
1513
1513 ACT II
1513 CHORUS.
1513 Scene I. An open place adjoining Capulet’s Garden.
1513 Scene II. Capulet’s Garden.
1513 Scene III. Friar Lawrence’s Cell.
1513 Scene IV. A Street.
1513 Scene V. Capulet’s Garden.
1513 Scene VI. Friar Lawrence’s Cell.
1513
1513
1513 ACT III
1513 Scene I. A public Place.
1513 Scene II. A Room in Capulet’s House.
1513 Scene III. Friar Lawrence’s cell.
1513 Scene IV. A Room in Capulet’s House.
1513 Scene V. An open Gallery to Juliet’s Chamber, overlooking the Garden.
1513
1513
1513 ACT IV
1513 Scene I. Friar Lawrence’s Cell.
1513 Scene II. Hall in Capulet’s House.
1513 Scene III. Juliet’s Chamber.
1513 Scene IV. Hall in Capulet’s House.
1513 Scene V. Juliet’s Chamber; Juliet on the bed.
1513
1513
1513 ACT V
1513 Scene I. Mantua. A Street.
1513 Scene II. Friar Lawrence’s Cell.
1513 Scene III. A churchyard; in it a Monument belonging to the Capulets.
1513

Tidy text

In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.

romeo_and_juliet <- romeo_and_juliet[c("text")] %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

datatable(head(romeo_and_juliet,100))

Number of Positive and Negative Sentiment

The top words associated with negative sentiment are related to “death”. The top word associated with positive sentiment is “love”.

library(wordcloud)
library(reshape2)

word_count <- romeo_and_juliet %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

word_count %>%
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

word_count%>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "green"),
                   max.words = 100)

NRC

“Death” appears in the anger, anticipation, disgust, fear, negative, sadness, and surprise.

romeo_and_juliet_nrc <- romeo_and_juliet %>% 
    inner_join(get_sentiments("nrc")) %>%
    count(word, sentiment)
romeo_and_juliet_nrc
## # A tibble: 2,238 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 abuse    anger         1
##  2 abuse    disgust       1
##  3 abuse    fear          1
##  4 abuse    negative      1
##  5 abuse    sadness       1
##  6 accident fear          1
##  7 accident negative      1
##  8 accident sadness       1
##  9 accident surprise      1
## 10 account  trust         2
## # … with 2,228 more rows
romeo_and_juliet_nrc %>% 
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

Net Sentiment

There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .

Compare how the sentiment changes in “Romeo and Juliet” using four sentiment dictionaries. We can see on the plot there are similar overall trends in the sentiment across the book. There is negative sentiment at the end of the book. This is not surprising as Romeo and Juliet had a tragic ending.

afinn1 <- romeo_and_juliet %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc1 <- bind_rows(
  romeo_and_juliet %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  romeo_and_juliet %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC"),
    romeo_and_juliet %>% 
    inner_join(get_sentiments("loughran") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "loughran")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn1, 
          bing_and_nrc1) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Financial News

There is one more lexicon we have not used from the ‘textdata’ package. “loughran” is a lexicon mainly use with financial statements .

On Kaggle , I found a dataset that contains financial news headlines.

Tidy text

In order to run our sentimental analysis, we need the text as one-token-per-row format by using the ‘unnest_tokens’ function in the ‘tidytext’ package.

library(tidyr)
library(dplyr)
library(stringr)
raw_financial <- read.delim(file = "https://raw.githubusercontent.com/suswong/Data-607-Assignments/main/all-data.csv", header = FALSE, sep = ",")

datatable(raw_financial)
colnames(raw_financial) <- c("sendiment","text")

tidy_financial <- raw_financial[-1] %>%
  mutate(
    linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Sentiment Analysis

get_sentiments("loughran")
## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,140 more rows
financial_sentiment <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, index = linenumber, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)

g <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(sentiment) 

ggplot(g, aes(x= reorder(sentiment, n), y=n)) +
  geom_bar(stat="identity") + coord_flip()

word_count_financial <- tidy_financial %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE)

datatable(word_count_financial)
word_count_financial %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "green"),
                   max.words = 100)

Conclusion

Romeo and Juliet

With the graph visualization of the net sentiment of the plot, we can see that there are periods of up and down sentiment. However, ultimately, the plot ended with negative sentiment. This is not surprising as star crossed lovers had a tragic ending.

Top 5 Ebooks on Gutenberg

According to Gutenburg, the top 5 ebooks are:

  • “Romeo and Juliet” by William Shakespeare

  • “A Room with a View” by E. M. Forster

  • “Middlemarch” by George Eliot

  • “Moby Dick; Or, The Whale” by Herman Melville

  • “Little Women; Or, Meg, Jo, Beth, and Amy” by Louisa May Alcott.

Out of all 5 books, “Little Women” has the highest percentage of positive words and “Romeo and Juliet” has the highest percentage of negative words. Both “A Room with a View” and “Romeo and Juliet” plot started with negative sentiment and ended with negative sentiment. “Middlemarch” plot started with positive sentiment and ended with positive sentiment. “Moby Dick” ended with negative sentiment.

Source

  1. Silge, J., & Robinson, D. (2017). Text mining with R: A tydy approach. O´Reilly.

  2. Loughran-McDonald sentiment lexicon — lexicon_loughran. (n.d.). https://emilhvitfeldt.github.io/textdata/reference/lexicon_loughran.html

  3. Sentiment Analysis for Financial News. (2020, May 27). Kaggle. https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news

  4. Project Gutenberg. (n.d.). Project Gutenberg. https://www.gutenberg.org/