Sentiment Analysis provides a way to understand what is expressed in text. It can be happy or positive, sad or negative, or even neutral.
At some point, almost all of us have taken part in a sentiment analysis. Be it when youtube pops a survey when you’re loading a video or, if you have made an online purchase, the you are likely to receive an email from your provider asking how satisfied you were, and / or if there are areas that need to be improved on to make your experience better next time.
Sentiment Analysis helps to understand the narrative changes with emotions and opinion content as indicated in a particular text.
In today’s project, we follow the guide of Julia Silge & David Robinson on text mining and sentiment analysis.
All the data used are loaded from the packages such as janeaustenr, as well gutenberg.
Let’s get going.
library("janeaustenr")
## Warning: package 'janeaustenr' was built under R version 4.0.5
library("stringr")
## Warning: package 'stringr' was built under R version 4.0.5
library("dplyr")
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidytext")
## Warning: package 'tidytext' was built under R version 4.0.5
library("tokenizers")
## Warning: package 'tokenizers' was built under R version 4.0.5
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 4.0.5
library("tidyr")
## Warning: package 'tidyr' was built under R version 4.0.5
library("scales")
## Warning: package 'scales' was built under R version 4.0.5
text <- c("Because I could not stop for Death -",
"He kindly stopped me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text
## [1] "Because I could not stop for Death -"
## [2] "He kindly stopped me -"
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
text_df <- tibble(line = 1:4, text = text)
text_df
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death -
## 2 2 He kindly stopped me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immortality
text_df %>%
unnest_tokens(word, text)
## # A tibble: 19 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 me
## 12 3 the
## 13 3 carriage
## 14 3 held
## 15 3 but
## 16 3 just
## 17 3 ourselves
## 18 4 and
## 19 4 immortality
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
original_books
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 "by Jane Austen" Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 "(1811)" Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 "CHAPTER 1" Sense & Sensibility 10 1
## # ... with 73,412 more rows
tidy_books <- original_books %>%
unnest_tokens(words, text)
tidy_books
## # A tibble: 725,055 x 4
## book linenumber chapter words
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # ... with 725,045 more rows
A common text mining task
For this project, we use the the Gutenberg package, narrowing down to two books.
#install.packages("gutenbergr")
library("gutenbergr")
## Warning: package 'gutenbergr' was built under R version 4.0.5
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## ! curl package not installed, falling back to using `url()`
## Using mirror http://aleph.gutenberg.org
tidy_hgwells <- hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_hgwells %>%
count(word, sort = TRUE)
## # A tibble: 11,830 x 2
## word n
## <chr> <int>
## 1 time 461
## 2 people 302
## 3 door 260
## 4 heard 249
## 5 black 232
## 6 stood 229
## 7 white 224
## 8 hand 218
## 9 kemp 213
## 10 eyes 210
## # ... with 11,820 more rows
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_bronte %>%
count(word, sort = TRUE)
## # A tibble: 23,303 x 2
## word n
## <chr> <int>
## 1 time 1064
## 2 miss 854
## 3 day 826
## 4 hand 767
## 5 eyes 713
## 6 don’t 666
## 7 night 648
## 8 heart 638
## 9 looked 601
## 10 door 591
## # ... with 23,293 more rows
From the above count, we see that most used words are arranged in descending order, with time being the most used word, followed by miss, day all through door making it ot the top ten list of words.
Calculate the frequency for each word for the books by binding the data frames together.
Use of pivot_wider() & pivot_longer() functions from tidyr package to reshape the dataframe, for plotting and comparison.
freq <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
mutate(tidy_hgwells, author = "H.G. Wells"),
mutate(tidy_books, author = "Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer(`Brontë Sisters`:`H.G. Wells`,
names_to = "author", values_to = "proportion")
freq
## # A tibble: 51,490 x 4
## word `Jane Austen` author proportion
## <chr> <dbl> <chr> <dbl>
## 1 a NA Brontë Sisters 0.0000587
## 2 a NA H.G. Wells 0.0000148
## 3 aback NA Brontë Sisters 0.00000391
## 4 aback NA H.G. Wells 0.0000148
## 5 abaht NA Brontë Sisters 0.00000391
## 6 abaht NA H.G. Wells NA
## 7 abandon NA Brontë Sisters 0.0000313
## 8 abandon NA H.G. Wells 0.0000148
## 9 abandoned NA Brontë Sisters 0.0000900
## 10 abandoned NA H.G. Wells 0.000178
## # ... with 51,480 more rows
library(textdata)
## Warning: package 'textdata' was built under R version 4.0.5
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ book, ncol = 2, scales = "free_x")
prideprejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
prideprejudice
## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
afinn <- prideprejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
prideprejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
prideprejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ method, ncol = 1, scales = "free_y")
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free_y") +
labs(x = "Positive vs Negative Sentiments", y = NULL)
In the book, and even in day to day life, the word ‘miss’ is used to refer to a young and unmarried or young lady, yet the code above picks it up as an negative word.
We could add it as a custom stop-word
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexion = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 x 3
## word lexion lexicon
## <chr> <chr> <chr>
## 1 miss custom <NA>
## 2 a <NA> SMART
## 3 a's <NA> SMART
## 4 able <NA> SMART
## 5 about <NA> SMART
## 6 above <NA> SMART
## 7 according <NA> SMART
## 8 accordingly <NA> SMART
## 9 across <NA> SMART
## 10 actually <NA> SMART
## # ... with 1,140 more rows
We can use this to plot somewhat a bubble that contains the most used words, with the size of the word indicating the frequency of the word.
library(ggwordcloud)
## Warning: package 'ggwordcloud' was built under R version 4.0.5
wordcloud_df <- tidy_books %>%
anti_join(custom_stop_words) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment, word, sort = T) %>%
top_n(200)
## Joining, by = "word"
## Joining, by = "word"
## Selecting by n
wordcloud_df %>%
ggplot() +
geom_text_wordcloud_area(aes(label = word, size = n), shape = "star") +
scale_size_area(max_size = 15)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.5
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.5
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray20"),
max.words = 100)
## Joining, by = "word"
Tidy text analysis can be used to find where all chapters in Jane Austen’s novels are organized by one-word-per-row.
We can also use tidy text to ask questions such as most negative chapers in the novels, or even the proportion of negative words.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarise(words = n())
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarise(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords / words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343