The code can be found in Chapter Two of the book.
Citation: * Title: Sentiment analysis with tidy data * Author: Julia Silge and David Robinson * Date: 4/6/21 * Code version: a912425 * Availability: https://github.com/dgrtwo/tidy-text-mining/blob/master/02-sentiment-analysis.Rmd
The first thing we are going to need to do is install the sentiment datasets
library(tidytext)
#install.packages('textdata')
library('textdata')
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
The next step is to pick an author. I like F. Scott Fitzgerald, and as a cynical writer, I want to see if the sentiment analysis picks up on that. I will download the books from Gutenberg (the ones out of copyright at least).
library(stringr)
#install.packages('gutenbergr')
library('gutenbergr')
## This shows us the works by F Scott
gutenberg_works(str_detect(author, "Fitzgerald, F. Scott"))
## # A tibble: 4 x 8
## gutenberg_id title author gutenberg_autho~ language gutenberg_books~ rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 805 This ~ Fitzger~ 420 en <NA> Publi~
## 2 4368 Flapp~ Fitzger~ 420 en <NA> Publi~
## 3 6695 Tales~ Fitzger~ 420 en <NA> Publi~
## 4 9830 The B~ Fitzger~ 420 en <NA> Publi~
## # ... with 1 more variable: has_text <lgl>
paradise <- gutenberg_download(805)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
paradise$book <- 'This Side of Paradise'
flappers <- gutenberg_download(4368)
flappers$book <- 'Flappers and Philosophers'
tales <- gutenberg_download(6695)
tales$book <- 'Tales of the Jazz Age'
beautiful <- gutenberg_download(9830)
beautiful$book <- 'The Beautiful and Damned'
#this one wasn't shown in the dataset, so I had to find it online
gatsby <- gutenberg_download(64317)
gatsby$book <- 'The Great Gatsby'
books <- rbind(
paradise,
flappers,
tales,
beautiful,
gatsby
)
Now that we have the books in a dataframe, let’s add in line number and chapter number for each word.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tidy_books <- books %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Let’s see how this looks and get all the angry/bitter words that are used in the Great Gatsby
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "anger")
tidy_books %>%
filter(book == "The Great Gatsby") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 225 x 2
## word n
## <chr> <int>
## 1 hot 25
## 2 money 23
## 3 words 15
## 4 crazy 14
## 5 bad 11
## 6 broken 10
## 7 darkness 9
## 8 feeling 9
## 9 terrible 9
## 10 strained 8
## # ... with 215 more rows
Now we can assign a sentiment for every 50 since these books are typically smaller than the books of Jane Austen
library(tidyr)
f_scott_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 50, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Let’s graph the results
library(ggplot2)
ggplot(f_scott_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
I’m curious, did F. Scott become more cynical over time?
f_scott_sentiment %>% group_by(book) %>% summarise(total_sentiment = sum(sentiment)) %>% arrange(desc(total_sentiment))
## # A tibble: 5 x 2
## book total_sentiment
## <chr> <int>
## 1 Flappers and Philosophers -164
## 2 The Great Gatsby -282
## 3 Tales of the Jazz Age -287
## 4 This Side of Paradise -297
## 5 The Beautiful and Damned -797
I would have thought yes, but according to this model, F. Scott actually became less cynical over time. His two early books being the most negative.
Now, let’s look at the Great Gatsby individually
great_gastby <- tidy_books %>%
filter(book == "The Great Gatsby")
Let’s see how all the models compare when looking at the book
afinn <- great_gastby %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
great_gastby %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
great_gastby %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
And let’s see how that looks mapped out
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
And let’s add a wordcloud to see the most popular words in the book
library(wordcloud)
## Loading required package: RColorBrewer
great_gastby %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
great_gastby %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"