Text Mining | Sentiment in R

Sentiment Analysis provides a way to understand what is expressed in text. It can be happy or positive, sad or negative, or even neutral.

At some point, almost all of us have taken part in a sentiment analysis. Be it when youtube pops a survey when you’re loading a video or, if you have made an online purchase, the you are likely to receive an email from your provider asking how satisfied you were, and / or if there are areas that need to be improved on to make your experience better next time.

Sentiment Analysis helps to understand the narrative changes with emotions and opinion content as indicated in a particular text.

In today’s project, we follow the guide of Julia Silge & David Robinson on text mining and sentiment analysis.

All the data used are loaded from the packages such as janeaustenr, as well gutenberg.

Let’s get going.

Load the required library packages

library("janeaustenr")

## Warning: package 'janeaustenr' was built under R version 4.0.5

library("stringr")

## Warning: package 'stringr' was built under R version 4.0.5

library("dplyr")

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidytext")

## Warning: package 'tidytext' was built under R version 4.0.5

library("tokenizers")

## Warning: package 'tokenizers' was built under R version 4.0.5

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 4.0.5

library("tidyr")

## Warning: package 'tidyr' was built under R version 4.0.5

library("scales")

## Warning: package 'scales' was built under R version 4.0.5

text <- c("Because I could not stop for Death -",
          "He kindly stopped me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")
text

## [1] "Because I could not stop for Death -"  
## [2] "He kindly stopped me -"                
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"

text_df <- tibble(line = 1:4, text = text)
text_df

## # A tibble: 4 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop for Death -  
## 2     2 He kindly stopped me -                
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immortality

text_df %>%
  unnest_tokens(word, text)

## # A tibble: 19 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 me         
## 12     3 the        
## 13     3 carriage   
## 14     3 held       
## 15     3 but        
## 16     3 just       
## 17     3 ourselves  
## 18     4 and        
## 19     4 immortality

Tidying Jane Austen’s books

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()

original_books

## # A tibble: 73,422 x 4
##    text                    book                linenumber chapter
##    <chr>                   <fct>                    <int>   <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
##  2 ""                      Sense & Sensibility          2       0
##  3 "by Jane Austen"        Sense & Sensibility          3       0
##  4 ""                      Sense & Sensibility          4       0
##  5 "(1811)"                Sense & Sensibility          5       0
##  6 ""                      Sense & Sensibility          6       0
##  7 ""                      Sense & Sensibility          7       0
##  8 ""                      Sense & Sensibility          8       0
##  9 ""                      Sense & Sensibility          9       0
## 10 "CHAPTER 1"             Sense & Sensibility         10       1
## # ... with 73,412 more rows

Restructure the dataset into one-token-per-row format.

tidy_books <- original_books %>%
  unnest_tokens(words, text)
tidy_books

## # A tibble: 725,055 x 4
##    book                linenumber chapter words      
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows

To separate each line of text into tokens, the function uses tokenizers package.

Word Frequencies

A common text mining task
For this project, we use the the Gutenberg package, narrowing down to two books.

#install.packages("gutenbergr")
library("gutenbergr")

## Warning: package 'gutenbergr' was built under R version 4.0.5

Book 1: H.G. Wells

hgwells <- gutenberg_download(c(35, 36, 5230, 159))

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## ! curl package not installed, falling back to using `url()`

## Using mirror http://aleph.gutenberg.org

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

tidy_hgwells %>%
  count(word, sort = TRUE)

## # A tibble: 11,830 x 2
##    word       n
##    <chr>  <int>
##  1 time     461
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    224
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # ... with 11,820 more rows

Book 2: Bronte Sisters

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

So, what are the most words used by Bronte sisters?

tidy_bronte %>%
  count(word, sort = TRUE)

## # A tibble: 23,303 x 2
##    word       n
##    <chr>  <int>
##  1 time    1064
##  2 miss     854
##  3 day      826
##  4 hand     767
##  5 eyes     713
##  6 don’t    666
##  7 night    648
##  8 heart    638
##  9 looked   601
## 10 door     591
## # ... with 23,293 more rows

From the above count, we see that most used words are arranged in descending order, with time being the most used word, followed by miss, day all through door making it ot the top ten list of words.
Calculate the frequency for each word for the books by binding the data frames together.
Use of pivot_wider() & pivot_longer() functions from tidyr package to reshape the dataframe, for plotting and comparison.

freq <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer(`Brontë Sisters`:`H.G. Wells`,
               names_to = "author", values_to = "proportion")

freq

## # A tibble: 51,490 x 4
##    word      `Jane Austen` author          proportion
##    <chr>             <dbl> <chr>                <dbl>
##  1 a                    NA Brontë Sisters  0.0000587 
##  2 a                    NA H.G. Wells      0.0000148 
##  3 aback                NA Brontë Sisters  0.00000391
##  4 aback                NA H.G. Wells      0.0000148 
##  5 abaht                NA Brontë Sisters  0.00000391
##  6 abaht                NA H.G. Wells     NA         
##  7 abandon              NA Brontë Sisters  0.0000313 
##  8 abandon              NA H.G. Wells      0.0000148 
##  9 abandoned            NA Brontë Sisters  0.0000900 
## 10 abandoned            NA H.G. Wells      0.000178  
## # ... with 51,480 more rows

library(textdata)

## Warning: package 'textdata' was built under R version 4.0.5

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Sentiment Analysis with Inner Join

library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text,
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

library(ggplot2)

Plotting the sentiment scores to see the trajectory of each novel.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = "free_x")

Comparing sentiment dictionaries

prideprejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")
prideprejudice

## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows

afinn <- prideprejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  prideprejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  prideprejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

Now, there is an estimate of the positive - negative (net sentiment).

bind_rows(afinn, bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ method, ncol = 1, scales = "free_y")

Most common Positive & Negative words

Having the dataframe, we can analyse word counts.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
    ungroup()

## Joining, by = "word"

bing_word_counts

## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

Now let’s get a visual of it

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free_y") +
  labs(x = "Positive vs Negative Sentiments", y = NULL)

In the book, and even in day to day life, the word ‘miss’ is used to refer to a young and unmarried or young lady, yet the code above picks it up as an negative word.
We could add it as a custom stop-word

custom_stop_words <- bind_rows(tibble(word = c("miss"),
                               lexion = c("custom")),
                               stop_words)

custom_stop_words

## # A tibble: 1,150 x 3
##    word        lexion lexicon
##    <chr>       <chr>  <chr>  
##  1 miss        custom <NA>   
##  2 a           <NA>   SMART  
##  3 a's         <NA>   SMART  
##  4 able        <NA>   SMART  
##  5 about       <NA>   SMART  
##  6 above       <NA>   SMART  
##  7 according   <NA>   SMART  
##  8 accordingly <NA>   SMART  
##  9 across      <NA>   SMART  
## 10 actually    <NA>   SMART  
## # ... with 1,140 more rows

Wordcloud

We can use this to plot somewhat a bubble that contains the most used words, with the size of the word indicating the frequency of the word.

library(ggwordcloud)

## Warning: package 'ggwordcloud' was built under R version 4.0.5

wordcloud_df <- tidy_books %>%
  anti_join(custom_stop_words) %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = T) %>%
  top_n(200)

## Joining, by = "word"
## Joining, by = "word"

## Selecting by n

wordcloud_df %>%
  ggplot() +
  geom_text_wordcloud_area(aes(label = word, size = n), shape = "star") + 
  scale_size_area(max_size = 15)

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.0.5

## Loading required package: RColorBrewer

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

One may need to turn a datafame into a matrix using funtion, ‘comparison.cloud()’.
To find the most common positive & negative words, we can also do a sentiment analysis, tagging the +ve and -ve words using an ‘inner join’.

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.0.5

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray20"),
                   max.words = 100)

## Joining, by = "word"

Tidy text analysis can be used to find where all chapters in Jane Austen’s novels are organized by one-word-per-row.
We can also use tidy text to ask questions such as most negative chapers in the novels, or even the proportion of negative words.

bingnegative <- get_sentiments("bing") %>%
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarise(words = n())

## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarise(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords / words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>%
  ungroup()

## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343