Developing Sentiment Analysis in R

First, we make use of the tidytext package that comprises sentiment lexicons that are present in the dataset of ‘sentiments’

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.2.1
sentiments
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

Next, we will make use of the following three general purpose lexicon which make use of the unigrams, a type of n-gram model that consists of a sequence of 1 item, or a word selected from a particular corpus of text.

  1. AFINN - the words are given scores in the AFINN lexicon model that range from -5 to 5. A rise in negativity matches a decline in sentiment, whereas a rise in positivity matches a rise in sentiment.

  2. bing- categorises the sentiment as either positive or negative.

  3. loughran - analyses the shareholder reports.

In this project, the feelings from our data will be extracted using the Bing lexicons. The get sentiments() function can be used to get these lexicons.

get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

We will import our libraries “janeaustenr,” “stringr,” and “tidytext” in this phase. We shall receive textual information in the form of novels written by the author Jane Austen thanks to the janeaustenr package. We can effectively execute text analysis on our data thanks to tidytext. Utilizing the unnest tokens() method, we will format the text of our books into a neat format.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.2.1
library(stringr)
## Warning: package 'stringr' was built under R version 4.2.1
tidy_data <- austen_books() %>%
 group_by(book) %>%
 mutate(linenumber = row_number(),
   chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                          ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)

Our text has been cleaned up such that each row only contains a single word. We will now apply filter() to the terms that correspond to joy using the “bing” lexicon. In order to put our sentiment analysis model into practise, we’ll take words from the book Sense and Sensibility.

positive_sentiment <- get_sentiments("bing") %>% filter(sentiment == "positive")

tidy_data %>% filter(book == "Emma") %>% semi_join(positive_sentiment) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 668 × 2
##    word         n
##    <chr>    <int>
##  1 well       401
##  2 good       359
##  3 great      264
##  4 like       200
##  5 better     173
##  6 enough     129
##  7 happy      125
##  8 love       117
##  9 pleasure   115
## 10 right       92
## # … with 658 more rows

We see a lot of positive words in the results above, like “good,” “happy,” and “love,” among others. The spread() function will be used in the following step to divide our data into distinct columns for positive and negative attitudes. The total sentiment—the difference between positive and negative sentiment—will then be determined using the modify() function.

library(tidyr)
Emma_sentiment <- tidy_data %>%
 inner_join(get_sentiments('bing')) %>%
 count(book = "Emma" , index = linenumber %/% 80, sentiment) %>%
 spread(sentiment, n, fill = 0) %>%
 mutate(sentiment = positive - negative)
## Joining, by = "word"

The words from the book “Emma” will be visualised in the following phase based on their correlating positive and negative scores.

library(ggplot2)
ggplot(Emma_sentiment, aes(index, sentiment, fill = book)) + geom_bar(stat = "identity", show.legend = TRUE) + facet_wrap(~book, ncol = 2, scales = 'free_x')

Let’s go on to counting the novel’s most often occurring positive and negative words.

counting_words <- tidy_data %>% inner_join(get_sentiments('bing')) %>% count(word, sentiment, sort = TRUE)
## Joining, by = "word"
head(counting_words)
## # A tibble: 6 × 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

We will execute sentiment score visualisation in the following phase. We will plot the results along the axis that has words on it that are both positive and negative. Based on their scores, we will visualise our data using the ggplot() method.

counting_words %>%
 filter(n > 150) %>%
 mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(word, n, fill = sentiment))+
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")

Let’s make a wordcloud to represent the most often used positive and negative words in the visualisation. The following example shows how to plot both negative and positive terms in a single wordcloud using the comparision.cloud() function:

library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.1
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.2.1
## Loading required package: RColorBrewer
tidy_data %>%
 inner_join(get_sentiments('bing')) %>%
 count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("red", "dark green"),
          max.words = 100)
## Joining, by = "word"

This word cloud will enable us to efficiently visualize the negative as well as positive groups of data. Therefore, we are now able to see the different groups of data based on their corresponding sentiments.

Reference:

Top data science project - sentiment analysis project in R. (2019, July 13). DataFlair. Retrieved July 14, 2022, from https://data-flair.training/blogs/data-science-r-sentiment-analysis-project/