Introduction

In this particular file we will work on sentiment analysis. Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral according to Oxford. This work is based on two parts, i) to copy and reproduce the code from the book mentioned in the link at the bottom, and ii) to extend the same code to apply it to another corpus and also to do research on another lexicon from a different package apart from tidytext package. So let’s get down to business

Code from the book

In the book they first got the text data of novels and then the text was un-nested using unnest_tokens() function to make it tidy. Afterwards using group_by() and mutate() functions extra columns to keep track of which line and chapter of the book each word comes from. In order to achieve that we have to set up the environment first, so let’s acquire all the required libraries:

Setting up the environment:

library(tidyverse)
library(syuzhet)
library(tidytext)
library(textdata)
library(janeaustenr)
library(wordcloud)
library(reshape2)

Let’s put everything into columns as mentioned above:

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the three sentiment dictionaries:

Let’s use all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice. First, let’s use filter() to choose only the words from the one novel we are interested in.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
get_sentiments("bing") %>% 
  count(sentiment)

Most common positive and negative words:

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

Wordclouds:

We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in a tidy format is useful for other plots as well.

For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud in Figure 2.5.

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 100): miss could not be fit on page.
## It will not be plotted.

In other functions, such as comparison.cloud(), you may need to turn the data frame into a matrix with reshape2’s acast(). Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

Code Source: https://www.tidytextmining.com/sentiment.html

Extending the Code

In order to extend the code I will go ahead and apply it to data frame that contains the reviews of customers about Women Clothing E-Commerce. Initially, I will load the data set and display the first row only since it contains long reviews so we do not want the data frame to take all the space.

df <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/Womens%20Clothing%20E-Commerce%20Reviews.csv")
knitr::kable(head(df, 1))
X Clothing.ID Age Title Review.Text Rating Recommended.IND Positive.Feedback.Count Division.Name Department.Name Class.Name
0 767 33 Absolutely wonderful - silky and sexy and comfortable 4 1 0 Initmates Intimate Intimates

After loading the data set I will subset the review columns into a tibble and using tibble() function and also lower the case of letter for uniformity and easy match-up using str_to_lower() function and storing everything in variable called as `review.df``

review.df <- tibble(review_txt= str_to_lower(df$Review.Text))

Now our review column is ready for sentiment analysis

Bing

Now We will try check out the sentiments of customer using Bing lexicon from tidytext and see what are the results. In order to achieve that first we will un-nest the words in review.df using unnest_tokens() function and save words in word column. Similarly we will perform an inner join to join sentiment from Bing Lexicon using inner_join() function. And also get the frequency of each word’s occurrence using count()

bing_words_counts <- review.df|>
  unnest_tokens(output = word, input = review_txt)|>
  inner_join(get_sentiments("bing"))|>
  count(word, sentiment, sort = TRUE)
knitr::kable(head(bing_words_counts))
word sentiment n
love positive 8948
top positive 7405
like positive 7149
great positive 6114
perfect positive 3772
flattering positive 3517

We can see that there are close to 1800 words and in order to display that on graph, could be really messy. Let’s get the top 10 high frequency words.

bing_t10_senti <- bing_words_counts|>
  group_by(sentiment)|>
  slice_max(order_by = n, n=10)|>
  ungroup()|>
  mutate(word=reorder(word,n))

We got the top 10 word let’s plot the on a bar graph

bing_t10_senti|>
  ggplot(aes(word,n,fill=sentiment))+
  geom_bar(stat = 'identity')+labs(x='Sentiments', y='Contribution')+
  facet_wrap(~sentiment, scales = 'free_y')+coord_flip()+theme_bw()

We can see that there is a lot of positive sentiment from the customers towards the business in question. Similarly, we can display the top 100 word using wordcloud

bing_words_counts|>
  with(wordcloud(word, n, max.words = 100))

We can also display the words in terms of positivity and negativity using acast() function from reshape2 library.

library(reshape2)

bing_words_counts|>
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Loughran

Similarly, we can use another lexicon from called as loughran from package textdata. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”

loughran_words_counts <- review.df|>
  unnest_tokens(output = word, input = review_txt)|>
  inner_join(get_sentiments("loughran"))|>
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
loughran_t10_senti <- loughran_words_counts|>
  group_by(sentiment)|>
  slice_max(order_by = n, n=10)|>
  ungroup()|>
  mutate(word=reorder(word,n))
loughran_t10_senti|>
  ggplot(aes(word,n,fill=sentiment))+
  geom_bar(stat = 'identity')+labs(x='Sentiments', y='Contribution')+
  facet_wrap(~sentiment, scales = 'free_y')+coord_flip()+theme_bw()+theme(legend.position = 'none')

As we can see that along side positive reviews there were some uncertain reviews too. We can also display the top words

loughran_words_counts|>
  with(wordcloud(word, n, max.words = 100))

NRC

Similarly we can use nrc to carry out sentiment analysis. nrc is general purpose English sentiment/emotion lexicon. This lexicon labels words with six possible sentiments or emotions: “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, or “trust”. The annotations were manually done through Amazon’s Mechanical Turk.

syu_words_counts <- review.df|>
  unnest_tokens(output = word, input = review_txt)|>
  inner_join(get_sentiments("nrc"))|>
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
knitr::kable(head(syu_words_counts))
word sentiment n
love joy 8948
love positive 8948
top anticipation 7405
top positive 7405
top trust 7405
wear negative 6439
syu_t10_senti <- syu_words_counts|>
  group_by(sentiment)|>
  slice_max(order_by = n, n=10)|>
  ungroup()|>
  mutate(word=reorder(word,n))
knitr::kable(head(syu_t10_senti))
word sentiment n
fits anger 2855
lace anger 713
disappointed anger 584
hit anger 444
belt anger 442
bad anger 392
syu_t10_senti|>
  ggplot(aes(word,n,fill=sentiment))+
  geom_bar(stat = 'identity')+labs(x='Sentiments', y='Contribution')+
  facet_wrap(~sentiment, scales = 'free_y')+coord_flip()+theme_bw()+theme(legend.position = 'none')

We can see that nrc has a wide range of emotions which could be really help in the understanding different types of emotions from the reviewer/respondents. Similarly the top word according nrc can also be displayed:

syu_words_counts|>
  with(wordcloud(word, n, max.words = 100))

Conclusion:

Sentiment analysis is a great way of understanding what customers/audience wants. In the extension of code from the book we did apply three lexicons i.e.NRC, loughran and Bing, from different packages. All the lexicons has their own domains to be applied, for instance, nrc is a general purpose lexicon compare to loughran which is more business/finance oriented. Similarly, bing is a great lexicon for polarization. In our analysis the reviews came out to be very positive but we have to bear in mind that the analysis was carried out based on literal meaning of word which negates the contextual meaning. There are ways to analyze text based on complete sentences but those were not covered here.