Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. in Text Mining with R, Chapter 2 Book the author provide some example code how to perfrom a sentiment analysis using tidytext, tidyr and dplyr packages. the goal of this assignment is to explore these examples, and extend one the codes provided.
tidytext() package is used for text mining for word processing and sentiment analysis using ‘dplyr’, ‘ggplot2’, and other tidy tools.
library(textdata)
library(tidytext)
library(tidyr)
The tidytext package provides access to several sentiment lexicons Let’s check some of the lexicons using get_sentiment get_sentiments function.
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
There are two other lexicon The bing lexicon that categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
A first step before doing a sentiment analysis, is to prepare the data in tidy format. the data text was taken from Jane Austen’s novel loaded using the janeaustenr package, than converted to the tidy format using unnest_tokens()
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
head(tidy_books)
## # A tibble: 6 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
Now that the text is in a tidy format, the sentiment analysis can be performed.
# Count the most common joy words in Emma?
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # … with 293 more rows
Examine how sentiment changes throughout each novel, and count how many positive and negative words
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
head(jane_austen_sentiment)
## # A tibble: 6 x 5
## book index negative positive sentiment
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
In this examples We’ll use we will use the book Sense and Sensibility and derive its words to implement out sentiment analysis model.
tidyData <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
positive_senti <- get_sentiments("bing") %>%
filter(sentiment == "positive")
tidyData %>%
filter(book == "Emma") %>%
semi_join(positive_senti) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 668 x 2
## word n
## <chr> <int>
## 1 well 401
## 2 good 359
## 3 great 264
## 4 like 200
## 5 better 173
## 6 enough 129
## 7 happy 125
## 8 love 117
## 9 pleasure 115
## 10 right 92
## # … with 658 more rows
library(tidyr)
bing <- get_sentiments("bing")
Emma_sentiment <- tidyData %>%
inner_join(bing) %>%
count(book = "Emma" , index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
head(bing)
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
Let us now proceed towards counting the most common positive and negative words that are present in the novel.
counting_words <- tidyData %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
head(counting_words)
## # A tibble: 6 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
we will perform visualization of our sentiment score.
counting_words %>%
filter(n > 150) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment))+
geom_col() +
coord_flip() +
labs(y = "Sentiment Score")
In the final visualization, let us create a wordcloud that will delineate the most recurring positive and negative words.
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(wordcloud)
## Loading required package: RColorBrewer
tidyData %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "dark green"),
max.words = 100)
## Joining, by = "word"
In this Assignment, we went through some examples of sentiment analysis from ext Mining with R. We learnt about the concept of sentiment analysis and implemented it over the dataset of Jane Austen’s books. We used a another lexical analyzer – ‘bing’. Furthermore, we also represented the sentiment score through a plot and also made a visual report of wordcloud.