Assignment 10A: Sentiment Analysis

Author

Emily El Mouaquite

Approach

Reproduce the sentiment analysis found in chapter 2 of Text Mining with R.
Extend the analysis using this dataset containing quotes from bestselling books about happiness.
Utilize the Syuzhet lexicon for further analysis.

Code Base

Reproduction of Base Example

Example code from chapter 2 of Text Mining with R:

library(tidytext)

get_sentiments("afinn")

# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

get_sentiments("bing")

# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

get_sentiments("nrc")

# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

library(janeaustenr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice

# A tibble: 122,204 × 4
   book              linenumber chapter word     
   <fct>                  <int>   <int> <chr>    
 1 Pride & Prejudice          1       0 pride    
 2 Pride & Prejudice          1       0 and      
 3 Pride & Prejudice          1       0 prejudice
 4 Pride & Prejudice          3       0 by       
 5 Pride & Prejudice          3       0 jane     
 6 Pride & Prejudice          3       0 austen   
 7 Pride & Prejudice          7       1 chapter  
 8 Pride & Prejudice          7       1 1        
 9 Pride & Prejudice         10       1 it       
10 Pride & Prejudice         10       1 is       
# ℹ 122,194 more rows

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

Joining with `by = join_by(word)`

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)

# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative   3316
2 positive   2308

get_sentiments("bing") %>% 
  count(sentiment)

# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative   4781
2 positive   2005

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

bing_word_counts

# A tibble: 2,585 × 3
   word     sentiment     n
   <chr>    <chr>     <int>
 1 miss     negative   1855
 2 well     positive   1523
 3 good     positive   1380
 4 great    positive    981
 5 like     positive    725
 6 better   positive    639
 7 enough   positive    613
 8 happy    positive    534
 9 love     positive    495
10 pleasure positive    462
# ℹ 2,575 more rows

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words

# A tibble: 1,150 × 2
   word        lexicon
   <chr>       <chr>  
 1 miss        custom 
 2 a           SMART  
 3 a's         SMART  
 4 able        SMART  
 5 about       SMART  
 6 above       SMART  
 7 according   SMART  
 8 accordingly SMART  
 9 across      SMART  
10 actually    SMART  
# ℹ 1,140 more rows

library(wordcloud)

Loading required package: RColorBrewer

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Joining with `by = join_by(word)`

library(reshape2)


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

p_and_p_sentences$sentence[2]

[1] "by jane austen"

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

# A tibble: 6 × 2
  book                chapters
  <fct>                  <int>
1 Sense & Sensibility       51
2 Pride & Prejudice         62
3 Mansfield Park            49
4 Emma                      56
5 Northanger Abbey          32
6 Persuasion                25

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

Joining with `by = join_by(word)`
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.

# A tibble: 6 × 5
  book                chapter negativewords words  ratio
  <fct>                 <int>         <int> <int>  <dbl>
1 Sense & Sensibility      43           161  3405 0.0473
2 Pride & Prejudice        34           111  2104 0.0528
3 Mansfield Park           46           173  3685 0.0469
4 Emma                     15           151  3340 0.0452
5 Northanger Abbey         21           149  2982 0.0500
6 Persuasion                4            62  1807 0.0343

Works Cited

Silge, Julia, and David Robinson. “Chapter 2: Sentiment Analysis with Tidy Data.” Text Mining with R: A Tidy Approach, O’Reilly Media, Inc., 2017.

Analysis Extension

This analysis can be extended through analyzing the Kaggle dataset Bestseller Happiness Books 📚 Reviews, Quotes, which contains quotes from books about happiness.

#read csv
happiness_books <- read.csv("data.csv")
head(happiness_books)

                      name         authors
1          Solve For Happy       Mo Gawdat
2   Stumbling On Happiness     Dan Gilbert
3  The Happiness Advantage     Shawn Achor
4 The Happiness Hypothesis  Jonathan Haidt
5                 Flourish Martin Seligman
6          The Power Of No  James Altucher
                                                                                                                                      favorite_quote
1 If you can afford the brain cycles to worry about the future- then by definition you have nothing to worry about right now. Right now- you're okay
2                           The secret of happiness is variety- but the secret of variety- like the secret of all spices- is knowing when to use it.
3                                                         I could care less about whether it's half full or half empty - as long as I can fill it up
4                                                                                 Love and work are to people what water and sunshine are to plants.
5 I'm trying to broaden the scope of positive psychology well beyond the smiley face. Happiness is just one-fifth of what human beings choose to do.
6                                                                           When you get in the mud with a pig- you get dirty and the pig gets happy
                                                                                                                                                                                                                                                  One_line_review
1                                                                             Solve For Happy lays out a former Google engineers formula for happiness- which shows you that it's our default state and how to overcome the obstacles we face in remaining in it.
2 Stumbling On Happiness examines the capacity of our brains to fill in gaps and simulate experiences- shows how our lack of awareness of these powers sometimes leads us to wrong decisions- and how we can change our behavior to synthesize our own happiness.
3                                                                    The Happiness Advantage turns the tables on happiness by proving it is a tool for success rather than of the result of it- sharing seven actionable principles you can use to increase both.
4                                              The Happiness Hypothesis is the most thorough analysis of how you can find happiness in our modern society- backed by plenty of scientific research- real-life examples- and even a literal formula for happiness.
5                                                                              Flourish establishes a new model for well-being rooted in positive psychology- building on five key pillars to help you create a happy life through the power of simple exercises.
6                                                             The Power Of No is an encompassing instruction manual on using the power of a little word to get healthy- rid yourself of bad relationships- embrace abundance- and ultimately say yes to yourself.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          why_should_read
1 Sometimes- it takes a tragedy to understand happiness is a choice. Mo Gawdat knows. He lost his 21-year-old son Ali. He taught himself to choose happiness instead of sadness regardless. What made Gawdat's choice an obvious one was the formula that he and Ali had been working on for years: "Happiness is equal to or greater than the events of your life minus your expectation of how life should be."This incredible book shows you why your perspective- more than anything else- determines your happiness.
2                                                                                                                                             This book examines how your brain tries to lie to you- specifically about what will happen in the future. Dan Gilbert's years of research show just how our minds trick us into worrying- which makes us unhappy with our decisions even before we make them. It turns out that a big key to happiness is figuring out how to tell the difference between fact and fiction!
3                                                                                                                                                                                                                                 Shawn Achor's research reveals the lies in the conventional idea that hard work and success lead to happiness. He's identified- with science- that happiness comes first- then you will become successful. This book points to several ways that you can start being happier right now.
4                                                                                                                                This book dives into the neurological aspects that contribute to happiness with a twist. Instead of getting lost in medical terms- Haidt employs the memorable analogy of a rider on an elephant. The metaphor shows how we can harness our brains to make us happy. More importantly- you'll learn how to build thinking and relationship habits that will lead to long-term happiness.
5                                                                                                                                                    Martin Seligman is the father of positive psychology. Prior to his work- brain science was based solely on the problems with the mind. Seligman changed that with his research. He is one of the best sources for beating dysfunctional thinking patterns. This book stands out with simple but powerful exercises you can do immediately to improve your happiness.
6                                                                                                                                                        Ultimately- this book is not about saying no- although you'll get a lot of tips on how to do that. The benefit of this book lies in learning to eliminate unnecessary things from your life so that you can say yes to yourself. It's packed with practical tips for ridding your life of that which pulls you down- which will make you feel freer and happier.
                                                                               key_takeaway_1
1                                                       Your inner voice is not the real you.
2                  Your brain is really bad at filling in the blanks- but it keeps on trying.
3                                               Happiness comes before success- not after it.
4 Surround yourself with the people you love the most and live in accordance with reciprocity
5                               A life of profound fulfillment is built on the acronym PERMA.
6                                           Rate your regulars to say no to the wrong people.
                                                              key_takeaway_2
1 Many cognitive filters prevent you from seeing the whole world around you.
2         You should always compare products based on value- never on price.
3          You can train yourself to be optimistic with the "Tetris Effect."
4                                               Do work that matters to you.
5                Simple positivity exercises can have life-changing effects.
6         Stop doing things you don't like- and everyone will be better off.
                                                                                        key_takeaway_3
1 No matter if life is good or bad- staying in the present always makes you feel more content with it.
2                                                      Bad experiences are better than no experiences.
3                                                                             Fall up instead of down.
4                                Find a partner who will stand by your side through sunshine and rain.
5                   IQ isn't everything - success is based on character traits- not just intelligence.
6                                                   Say no to scarcity to go beyond "glass half full."

We will use the column favorite_quote in order to perform the sentiment analysis using the Syuzhet lexicon. Information on using this package that I used to inform the following sentiment analysis can be found here.

library(syuzhet)
#get sentiment scores 
syuzhet_scores <- get_sentiment(happiness_books$favorite_quote,
                                method = "syuzhet")
#add sentiment scores to the dataset 
happiness_books$sentiment <- syuzhet_scores

ggplot(happiness_books, aes(x = sentiment)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of Sentiment Scores (Syuzhet)",
       x = "Sentiment Score",
       y = "Count")

The sentiment scores provided by the Syuzhet lexicon read as a lower score meaning a more negative sentiment, while a higher score alluding to a more positive sentiment. The above histogram is skewed to the left. This means that most of the sentiment scores for the quotes in the dataset are higher, or have a positive sentiment. This follows as the dataset contains quotes from books on happiness. There is one outlier with a sentiment score of almost -5.

Conclusion

The results of the above sentiment analysis differ from the original example primarily because of the nature of the texts, as well as the sentiment method used. While the original analysis of novels by Jane Austen shows significant fluctuations in sentiment, the happiness quote dataset produces more consistently positive sentiment scores with less variation. Additionally, this analysis incorporates the Syuzhet sentiment lexicon, which calculates sentiment scores across entire text passages, resulting in smoother sentiment trends compared to the singular word level lexicon approaches used in the original example.