First, we make use of the tidytext package that comprises sentiment lexicons that are present in the dataset of ‘sentiments’
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.2.1
sentiments
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
Next, we will make use of the following three general purpose lexicon which make use of the unigrams, a type of n-gram model that consists of a sequence of 1 item, or a word selected from a particular corpus of text.
AFINN - the words are given scores in the AFINN lexicon model that range from -5 to 5. A rise in negativity matches a decline in sentiment, whereas a rise in positivity matches a rise in sentiment.
bing- categorises the sentiment as either positive or negative.
loughran - analyses the shareholder reports.
In this project, the feelings from our data will be extracted using the Bing lexicons. The get sentiments() function can be used to get these lexicons.
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
We will import our libraries “janeaustenr,” “stringr,” and “tidytext” in this phase. We shall receive textual information in the form of novels written by the author Jane Austen thanks to the janeaustenr package. We can effectively execute text analysis on our data thanks to tidytext. Utilizing the unnest tokens() method, we will format the text of our books into a neat format.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.2.1
library(stringr)
## Warning: package 'stringr' was built under R version 4.2.1
tidy_data <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Our text has been cleaned up such that each row only contains a single word. We will now apply filter() to the terms that correspond to joy using the “bing” lexicon. In order to put our sentiment analysis model into practise, we’ll take words from the book Sense and Sensibility.
positive_sentiment <- get_sentiments("bing") %>% filter(sentiment == "positive")
tidy_data %>% filter(book == "Emma") %>% semi_join(positive_sentiment) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 668 × 2
## word n
## <chr> <int>
## 1 well 401
## 2 good 359
## 3 great 264
## 4 like 200
## 5 better 173
## 6 enough 129
## 7 happy 125
## 8 love 117
## 9 pleasure 115
## 10 right 92
## # … with 658 more rows
We see a lot of positive words in the results above, like “good,” “happy,” and “love,” among others. The spread() function will be used in the following step to divide our data into distinct columns for positive and negative attitudes. The total sentiment—the difference between positive and negative sentiment—will then be determined using the modify() function.
library(tidyr)
Emma_sentiment <- tidy_data %>%
inner_join(get_sentiments('bing')) %>%
count(book = "Emma" , index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
The words from the book “Emma” will be visualised in the following phase based on their correlating positive and negative scores.
library(ggplot2)
ggplot(Emma_sentiment, aes(index, sentiment, fill = book)) + geom_bar(stat = "identity", show.legend = TRUE) + facet_wrap(~book, ncol = 2, scales = 'free_x')
Let’s go on to counting the novel’s most often occurring positive and negative words.
counting_words <- tidy_data %>% inner_join(get_sentiments('bing')) %>% count(word, sentiment, sort = TRUE)
## Joining, by = "word"
head(counting_words)
## # A tibble: 6 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
We will execute sentiment score visualisation in the following phase. We will plot the results along the axis that has words on it that are both positive and negative. Based on their scores, we will visualise our data using the ggplot() method.
counting_words %>%
filter(n > 150) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment))+
geom_col() +
coord_flip() +
labs(y = "Sentiment Score")
Let’s make a wordcloud to represent the most often used positive and negative words in the visualisation. The following example shows how to plot both negative and positive terms in a single wordcloud using the comparision.cloud() function:
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.1
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.2.1
## Loading required package: RColorBrewer
tidy_data %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "dark green"),
max.words = 100)
## Joining, by = "word"
This word cloud will enable us to efficiently visualize the negative as well as positive groups of data. Therefore, we are now able to see the different groups of data based on their corresponding sentiments.
Reference:
Top data science project - sentiment analysis project in R. (2019, July 13). DataFlair. Retrieved July 14, 2022, from https://data-flair.training/blogs/data-science-r-sentiment-analysis-project/