Sentiment Analysis

When a human reader reads a text he uses his intellect to categorize the text as positive or negative or some other category. Similarly, this can be done programmatically. To make a program able to categorize a text into some categories, programmers use the strategy which is known as sentiment analysis. The textbook ‘The Text Mining with R’ chapter 2 deals the sentiment analysis and also provides some sample codes to deal with different strategy. The following flow chart depicts the typical text analysis steps:

Figure 1: Flowchart showing the text analysis steps

The first step in the text analysis is the tokenization. In this step a text is separated into words such that one word per line in the dataframe. The well known library used for this purpose is tidytext.

The text is composed of words and sentiment of text is overall sentiments of words. The package tidytext provide functions to get the commonly used lexicons such as ‘afinn’, ‘bing’ and ‘nrc’ etc. Let’s see some lexicons.

library(tidytext)
afin <- get_sentiments("afinn")
afin[1:10, ]
bing <- get_sentiments("bing")
bing[1:10, ]

But lexicons nrc was not downloaded due to some error.

Sentiment analysis with innerjoin

library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
library(ggplot2)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)


jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

The diagram shows sentiment analysis of books by Jane Austin. It can be seen that each diagram shows positive or negative sentiment flow as the story proceeds.

Word counts and its visuals

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

## Word Cloud

library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Task

Here, the other text from the gutenberg can be downloaded using the package gutenbergr. There are enormous number of books avaialbe on gutenberg.org but we are interested on the books by H.G. Wells and specifically the following books: (i) The Time Machine,

  1. The War of the Worlds,

  2. The Invisible Man, and

  3. The Island of Doctor Moreau

The books can be accessed using the gutenberg project id number using the the function gutenberg_download().

library(gutenbergr)

hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_hgwells %>%
  count(word, sort = TRUE)%>%
  filter(n>200)

Word cloud

tidy_hgwells %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 50))
## Joining with `by = join_by(word)`

Here the size of a word is proportional to its frequency. The bigger word in the word cloud is more frequent than the smaller word. It can be seen that the word ‘suddenly’ is more frequent than ‘presently’ in HG Wells’ work.

bing_word_counts <- tidy_hgwells %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 11433 of `x` matches multiple rows in `y`.
## ℹ Row 4456 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Other packages to offer lexicons are sentimentr, syuzhet, quanteda, tm etc. We can use sentimentr to get some lexicons

library(sentimentr)
nrc_joy <- lexicon::nrc_emotions %>% 
  filter(joy==1)%>%
  select(term)
colnames(nrc_joy)<- 'word'

tidy_hgwells %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)%>%
  filter(n>100)
## Joining with `by = join_by(word)`
References

1.Arnold, Taylor B. 2016. cleanNLP: A Tidy Data Model for Natural Language Processing. https://cran.r-project.org/package=cleanNLP.

  1. Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x.

  2. Silge, Julia. 2016. janeaustenr: Jane Austen’s Complete Novels. https://CRAN.R-project.org/package=janeaustenr.

  3. Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” JOSS 1 (3). https://doi.org/10.21105/joss.00037.

  4. Silge, Julia, and David Robinson. “Text Mining With R: A tidy approach”, 1st Edition, Chapter 2. https://www.tidytextmining.com/