Overview

In this assignment, I used the primary example code from chapter 2 working in this R Markdown document. In the chapter the authors used three sentiment lexicons, Afinn, bing, and nrc. After we were to work with a different corpus of our choosing, and incorporate at least one additional sentiment lexicon. I chose the new Supreme Court Justice Ketanji Brown Jackson’s opening speech and found the sentiment based on the loughron lexicon.

Textbook example

This section goes over a brief part of the example given in the textbook. ## Sentiment Lexicons

Afinn

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(head(get_sentiments("afinn")))
word value
abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
abductions -2

Bing

Bing categorizes words into positive and negative categories.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(head(get_sentiments("bing")))
word sentiment
2-faces negative
abnormal negative
abolish negative
abominable negative
abominably negative
abominate negative

NRC

The NRC lexicon categorizes words that are binary into categories that are positive, negative, anger, anticipation, disgust, fear, etc.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

knitr::kable(head(get_sentiments("nrc")))
word sentiment
abacus trust
abandon fear
abandon negative
abandon sadness
abandoned anger
abandoned fear

In the textbook example, the code found the sentiment on Jane austen boosk. The data when tidy can be inner joined with the specific sentiment lexicons.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

tidy_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word,text)

knitr::kable(head(tidy_books))
book linenumber chapter word
Sense & Sensibility 1 0 sense
Sense & Sensibility 1 0 and
Sense & Sensibility 1 0 sensibility
Sense & Sensibility 3 0 by
Sense & Sensibility 3 0 jane
Sense & Sensibility 3 0 austen

Next the textbook used the NRC lexicon to perform sentiment analysis and filter to find the joy words. Below the tidy_books represents the where in the book Emma there is words NRC defines as joy.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy")
tidy_books %>% filter(book == "Emma") %>% inner_join(nrc_joy) %>% count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows
knitr::kable(head(nrc_joy))
word sentiment
absolution joy
abundance joy
abundant joy
accolade joy
accompaniment joy
accomplish joy
#Source: "Text Mining with R: A Tidy Approach" Chapter 2


jane_austen_sentiment <- tidy_books %>% inner_join(get_sentiments("bing")) %>% count(book, index = linenumber %/% 80, sentiment) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

## Comparing the three sentiment dictionaries

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

knitr::kable(head(pride_prejudice))
book linenumber chapter word
Pride & Prejudice 1 0 pride
Pride & Prejudice 1 0 and
Pride & Prejudice 1 0 prejudice
Pride & Prejudice 3 0 by
Pride & Prejudice 3 0 jane
Pride & Prejudice 3 0 austen
#Source: "Text Mining with R: A Tidy Approach" Chapter 2

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(get_sentiments("bing") %>% 
  count(sentiment))
sentiment n
negative 4781
positive 2005

Most common positive and negative words

# Source: "Text Mining with R: A Tidy Approach" Chapter 2
bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
knitr::kable(head(bing_word_counts))
word sentiment n
miss negative 1855
well positive 1523
good positive 1380
great positive 981
like positive 725
better positive 639
# Source: "Text Mining with R: A Tidy Approach" Chapter 2
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Sentiment on a new Corpus

Scrape website

To find the speech I scraped the PBS website of the transript of Judge Jackson’s opening remarks. Then I saved it to value speech.

speech_website <- read_html("https://www.pbs.org/newshour/politics/read-the-full-text-of-supreme-court-nominee-ketanji-brown-jacksons-opening-remarks")
speech <- speech_website %>%
html_nodes("p") %>%
html_text()

Syuzhet

Syuzhet is a package with its own lexicon. In the package it defines that its sentiments function returns a data frame where each row represents a sentence from the original file. The different columns are: “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, “trust”, “negative”, “positive.”

Instead of get_sentiments the function is get_sentiment. For each sentence it gives a numeric value of the sentiment. Shown in the output below.

get_sentiment(speech[3:24])
##  [1]  0.50  0.00  9.00  2.75  4.05  2.85  5.30  5.50  3.20  3.05  3.90  2.70
## [13]  9.05  2.60  3.00  5.00  6.10  5.40  3.15  1.20 -0.25  3.80

With get_nrc_sentiment it calculates eight different emotions and the sentiments in the text file.

knitr::kable(get_nrc_sentiment(speech[3:24]))
anger anticipation disgust fear joy sadness surprise trust negative positive
1 2 0 1 0 0 0 4 0 2
0 0 0 0 0 0 0 0 0 0
1 3 0 2 2 1 0 10 1 13
0 0 0 1 2 0 0 3 1 5
0 1 0 0 0 0 0 6 0 7
0 2 0 1 4 0 0 3 1 5
0 1 1 0 2 0 1 7 0 5
0 6 0 1 5 1 2 8 0 8
0 3 0 0 1 1 0 4 1 6
0 4 0 1 2 0 0 4 0 9
0 1 0 2 3 1 0 4 1 4
0 0 0 0 1 1 1 2 0 4
0 2 0 0 4 0 1 4 0 9
0 0 0 0 0 0 0 4 0 1
0 3 0 0 1 0 0 5 1 7
1 3 0 1 2 0 2 3 0 5
2 4 1 3 3 0 3 6 2 12
1 1 0 1 0 0 0 2 0 3
0 1 0 1 0 0 1 2 0 3
1 2 0 2 0 1 0 5 2 3
1 1 0 0 0 1 0 1 2 2
2 5 0 2 1 1 0 2 2 6
s_v <- get_sentences(speech[3:24])
s_v_sentiment <- get_sentiment(s_v)
s_v_sentiment
##  [1]  0.50  0.00  3.95  5.05  2.75  1.25  1.00  2.30  0.50  1.85  1.00  1.00
## [13]  4.30  1.50  1.60  1.65  0.25  0.75  0.50  0.00  0.85  1.60  0.25  1.15
## [25]  0.35 -0.50  2.05  1.65  0.95  1.30  0.75  1.80  0.40  0.50  1.60  1.40
## [37]  2.40  2.05  1.60  0.75  0.00  0.80  1.80  2.00  1.00  0.00  2.25  0.00
## [49]  1.50  1.25  5.10  1.00  1.50  2.10  3.30  0.80  3.15  0.00  0.00  1.20
## [61]  0.25 -0.50  0.00  0.80  3.00

Loughran

English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

knitr::kable(get_sentiments("loughran") %>% count(sentiment, sort = TRUE) )
sentiment n
negative 2355
litigious 904
positive 354
uncertainty 297
constraining 184
superfluous 56

The speech is mainly in the body from 3 to 24, so I slice the body of the webpage and save it to tidy_speech. Now I need to extract all the words, I strip the empty space and convert it into a data frame. Each row is a single word in the speech.

tidy_speech <- speech[3:24]

tidy_speech_words <- unlist(as.list(strsplit(tidy_speech, " ")))
rowNumber <- seq(1:length(tidy_speech_words))
words.df <- data.frame(rowNumber, tidy_speech_words)
names(words.df) <- c("rowNumber","word")

Just like the textbook example, I inner join the words in the loughran sentiment to the speech.

speech_sentiment_loughran <- words.df %>% inner_join(get_sentiments("loughran"))
## Joining, by = "word"

To count the amount of words per sentiment type I again followed the textbook. Inner join the words from the speech with the lexicon, then count the words by the sentiment and sort it. Next I graphed it using ggplot.

lang_word_counts <- words.df %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
lang_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Here you can see uncertainty was the largest sentiment type in the speech followed by positive. Loughran was able to identify some litigious words, which is expected from a speech for a potential Supreme Court judge.

Conclusion

When using lexicons it is important to note how binary the scoring of the words are and where the source of the text came from. Loughran for example was based on and mainly used for financial documents. The textbook shows how different lexicons can distribute sentiments differently. A possible extention to the new corpus is to compare the sentiment with different lexicons like the textbook did. Context also matters as it can change the meaning of words than just looking at them atomically.

References

Silge, Julia, and David Robinson. “Text Mining with R: A Tidy Approach”. O’Reilly Media, 2017