Re-create and analyze primary code from the textbook. Provide citation to text book, using a standard citation syntax like APA or MLA.
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.1.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The tidytext package provides access to several sentiment lexicons. The three general purpose lexions that are based on unigrams, i.e., single words are: 1. AFINN - lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. 2. nrc - lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative. 3. bing - lexicon categorizes words in a binary fashion into positive and negative categories.
The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,865 more rows
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.1.3
austen_books()
## # A tibble: 73,422 x 2
## text book
## * <chr> <fct>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility
## 2 "" Sense & Sensibility
## 3 "by Jane Austen" Sense & Sensibility
## 4 "" Sense & Sensibility
## 5 "(1811)" Sense & Sensibility
## 6 "" Sense & Sensibility
## 7 "" Sense & Sensibility
## 8 "" Sense & Sensibility
## 9 "" Sense & Sensibility
## 10 "CHAPTER 1" Sense & Sensibility
## # ... with 73,412 more rows
library(janeaustenr)
library(dplyr)
library(stringr)
## Warning: package 'stringr' was built under R version 4.1.2
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>% ungroup() %>%
unnest_tokens(word, text)
We chose the name ‘word’ for the output column from unnest_tokens() function, because the sentiment lexicons and stop word datasets have columns named word; performing inner joins and anti-joins is thus easier.
The text is in a tidy format with one word per row, we can perform sentiment analysis now.
First, use the NRC lexicon and filter() for the joy words.
Then, filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis
Use count() from dplyr, to count the most common joy words in Emma.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ... with 291 more rows
There are mostly positive, happy words.
First,find a sentiment score for each word using the Bing lexicon and inner_join().
Then, count up how many positive and negative words there are in defined sections of each book.
Next, define index to keeps track of which 80-line section of text we are counting up negative and positive sentiment in. The %/% operator does integer division (x %/% y is equivalent to floor(x/y))
Use pivot_wider() so that we have negative and positive sentiment in separate columns.
Lastly, calculate a net sentiment (positive - negative)
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Now, plot these sentiment scores across the plot trajectory of each novel, by plotting against the index on the x-axis that keeps track of narrative time in sections of text.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Now, compare all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
Since, AFINN lexicon measures sentiment with a numeric score between -5 and 5, now, use a different pattern for the AFINN lexicon than for the other two.
Use inner_join() to calculate the sentiment in different ways.
Next, define index to keeps track of which 80-line section of text we are counting up negative and positive sentiment in. The %/% operator does integer division (x %/% y is equivalent to floor(x/y))
Use pivot_wider() so that we have negative and positive sentiment in separate columns.
Lastly, calculate a net sentiment (positive - negative)
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
Now, we have calculated an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon.
Next step is, to bind them together to compare and visualize.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
### Analysis:
The results are different from the three different lexicons for calculating sentiments.
They have similar dips and peaks.
The AFINN lexicon shows the largest absolute values, with high positive values. This means sentiment has more variance.
The Bing lexicon shows the lower absolute values, and labels larger blocks of contiguous positive or negative text.
Compare to the other two, NRC results are shifted higher, labeling the text more positively, but detects similar relative changes in the text. This means, longer stretches of similar text,
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3318
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
Both have more positive than negative words, ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon.
is that we can analyze word counts that contribute to each sentiment by using count() function.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
and we can pipe straight into ggplot2 to show in the visualization
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
The word “miss” is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works.
A custom stop-words list can be used with bind_rows() funtion, to handle this deviation.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
Visualize the most common words in Jane Austen’s work again in wordcloud.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.1.3
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
###### reshape2 Package:
Now, perform sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words using reshape2 package.
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
The sentiment analysis algorithms look beyond only unigrams (i.e. single words) and try to understand the sentiment of a sentence as a whole. These algorithms try to understand that:
I am not having a good day.
is a sad sentence because of the use of negation.
here. text is tokenized into sentences, and it makes sense to use a new name for the output column in such a case.
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences
## # A tibble: 15,545 x 1
## sentence
## <chr>
## 1 "pride and prejudice"
## 2 "by jane austen"
## 3 "chapter 1"
## 4 "it is a truth universally acknowledged, that a single man in possession"
## 5 "of a good fortune, must be in want of a wife."
## 6 "however little known the feelings or views of such a man may be on his"
## 7 "first entering a neighbourhood, this truth is so well fixed in the minds"
## 8 "of the surrounding families, that he is considered the rightful property"
## 9 "of some one or other of their daughters."
## 10 "\"my dear mr."
## # ... with 15,535 more rows
Looking at one sentence:
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
#> [1] "by jane austen"
Drawback:
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text.
To split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>% ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
Now, find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words?
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>% ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
These are the chapters with the most sad words in each book, normalized for number of words in the chapter.