Introduction

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. in Text Mining with R, Chapter 2 Book the author provide some example code how to perfrom a sentiment analysis using tidytext, tidyr and dplyr packages. the goal of this assignment is to explore these examples, and extend one the codes provided.

Libraries

tidytext() package is used for text mining for word processing and sentiment analysis using ‘dplyr’, ‘ggplot2’, and other tidy tools.

library(textdata)
library(tidytext)
library(tidyr)

The tidytext package provides access to several sentiment lexicons Let’s check some of the lexicons using get_sentiment get_sentiments function.

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

There are two other lexicon The bing lexicon that categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Tidy format data

A first step before doing a sentiment analysis, is to prepare the data in tidy format. the data text was taken from Jane Austen’s novel loaded using the janeaustenr package, than converted to the tidy format using unnest_tokens()

library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
head(tidy_books)
## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

Sentiment analysis

Now that the text is in a tidy format, the sentiment analysis can be performed.

# Count the most common joy words in Emma? 
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # … with 293 more rows

Examine how sentiment changes throughout each novel, and count how many positive and negative words

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
head(jane_austen_sentiment)
## # A tibble: 6 x 5
##   book                index negative positive sentiment
##   <fct>               <dbl>    <dbl>    <dbl>     <dbl>
## 1 Sense & Sensibility     0       16       32        16
## 2 Sense & Sensibility     1       19       53        34
## 3 Sense & Sensibility     2       12       31        19
## 4 Sense & Sensibility     3       15       31        16
## 5 Sense & Sensibility     4       16       34        18
## 6 Sense & Sensibility     5       16       51        35
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Sentiment Analysis using Bing Lexicon

In this examples We’ll use we will use the book Sense and Sensibility and derive its words to implement out sentiment analysis model.

Format the data

tidyData <- austen_books() %>%
 group_by(book) %>%
 mutate(linenumber = row_number(),
   chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                          ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)

Count the words

    positive_senti <- get_sentiments("bing") %>%
     filter(sentiment == "positive")
    tidyData %>%
     filter(book == "Emma") %>%
     semi_join(positive_senti) %>%
     count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 668 x 2
##    word         n
##    <chr>    <int>
##  1 well       401
##  2 good       359
##  3 great      264
##  4 like       200
##  5 better     173
##  6 enough     129
##  7 happy      125
##  8 love       117
##  9 pleasure   115
## 10 right       92
## # … with 658 more rows

Segregate our data into separate columns of positive and negative sentiments.

    library(tidyr)
    bing <- get_sentiments("bing")
    Emma_sentiment <- tidyData %>%
     inner_join(bing) %>%
     count(book = "Emma" , index = linenumber %/% 80, sentiment) %>%
     spread(sentiment, n, fill = 0) %>%
     mutate(sentiment = positive - negative)
## Joining, by = "word"
    head(bing)
## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative

Count positive and negative words

Let us now proceed towards counting the most common positive and negative words that are present in the novel.

    counting_words <- tidyData %>%
     inner_join(bing) %>%
     count(word, sentiment, sort = TRUE)
## Joining, by = "word"
    head(counting_words)
## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

we will perform visualization of our sentiment score.

    counting_words %>%
     filter(n > 150) %>%
     mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
     mutate(word = reorder(word, n)) %>%
     ggplot(aes(word, n, fill = sentiment))+
     geom_col() +
     coord_flip() +
     labs(y = "Sentiment Score")

Visualization

In the final visualization, let us create a wordcloud that will delineate the most recurring positive and negative words.

    library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
    library(wordcloud)
## Loading required package: RColorBrewer
    tidyData %>%
     inner_join(bing) %>%
     count(word, sentiment, sort = TRUE) %>%
     acast(word ~ sentiment, value.var = "n", fill = 0) %>%
     comparison.cloud(colors = c("red", "dark green"),
              max.words = 100)
## Joining, by = "word"

Conclusion

In this Assignment, we went through some examples of sentiment analysis from ext Mining with R. We learnt about the concept of sentiment analysis and implemented it over the dataset of Jane Austen’s books. We used a another lexical analyzer – ‘bing’. Furthermore, we also represented the sentiment score through a plot and also made a visual report of wordcloud.