Overview

In this assignment, I used the primary example code from chapter 2 working in this R Markdown document. In the chapter the authors used three sentiment lexicons, Afinn, bing, and nrc. After we were to work with a different corpus of our choosing, and incorporate at least one additional sentiment lexicon. I chose the new Supreme Court Justice Ketanji Brown Jackson’s opening speech and found the sentiment based on the loughron lexicon.

Textbook example

This section goes over a brief part of the example given in the textbook. ## Sentiment Lexicons

Afinn

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(head(get_sentiments("afinn")))

word	value
abandon	-2
abandoned	-2
abandons	-2
abducted	-2
abduction	-2
abductions	-2

Bing

Bing categorizes words into positive and negative categories.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(head(get_sentiments("bing")))

word	sentiment
2-faces	negative
abnormal	negative
abolish	negative
abominable	negative
abominably	negative
abominate	negative

NRC

The NRC lexicon categorizes words that are binary into categories that are positive, negative, anger, anticipation, disgust, fear, etc.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

knitr::kable(head(get_sentiments("nrc")))

word	sentiment
abacus	trust
abandon	fear
abandon	negative
abandon	sadness
abandoned	anger
abandoned	fear

In the textbook example, the code found the sentiment on Jane austen boosk. The data when tidy can be inner joined with the specific sentiment lexicons.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

tidy_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word,text)

knitr::kable(head(tidy_books))

book	linenumber	word
Sense & Sensibility	1	sense
Sense & Sensibility	1	and
Sense & Sensibility	1	sensibility
Sense & Sensibility	3	by
Sense & Sensibility	3	jane
Sense & Sensibility	3	austen

Next the textbook used the NRC lexicon to perform sentiment analysis and filter to find the joy words. Below the tidy_books represents the where in the book Emma there is words NRC defines as joy.

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy")
tidy_books %>% filter(book == "Emma") %>% inner_join(nrc_joy) %>% count(word, sort=TRUE)

## Joining, by = "word"

## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows

knitr::kable(head(nrc_joy))

word	sentiment
absolution	joy
abundance	joy
abundant	joy
accolade	joy
accompaniment	joy
accomplish	joy

#Source: "Text Mining with R: A Tidy Approach" Chapter 2


jane_austen_sentiment <- tidy_books %>% inner_join(get_sentiments("bing")) %>% count(book, index = linenumber %/% 80, sentiment) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

## Comparing the three sentiment dictionaries

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

knitr::kable(head(pride_prejudice))

book	linenumber	word
Pride & Prejudice	1	pride
Pride & Prejudice	1	and
Pride & Prejudice	1	prejudice
Pride & Prejudice	3	by
Pride & Prejudice	3	jane
Pride & Prejudice	3	austen

#Source: "Text Mining with R: A Tidy Approach" Chapter 2

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

#Source: "Text Mining with R: A Tidy Approach" Chapter 2
knitr::kable(get_sentiments("bing") %>% 
  count(sentiment))

sentiment	n
negative	4781
positive	2005

Most common positive and negative words

# Source: "Text Mining with R: A Tidy Approach" Chapter 2
bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

knitr::kable(head(bing_word_counts))

word	sentiment	n
miss	negative	1855
well	positive	1523
good	positive	1380
great	positive	981
like	positive	725
better	positive	639

# Source: "Text Mining with R: A Tidy Approach" Chapter 2
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Sentiment on a new Corpus

Scrape website

To find the speech I scraped the PBS website of the transript of Judge Jackson’s opening remarks. Then I saved it to value speech.

speech_website <- read_html("https://www.pbs.org/newshour/politics/read-the-full-text-of-supreme-court-nominee-ketanji-brown-jacksons-opening-remarks")
speech <- speech_website %>%
html_nodes("p") %>%
html_text()

Syuzhet

Syuzhet is a package with its own lexicon. In the package it defines that its sentiments function returns a data frame where each row represents a sentence from the original file. The different columns are: “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, “trust”, “negative”, “positive.”

Instead of get_sentiments the function is get_sentiment. For each sentence it gives a numeric value of the sentiment. Shown in the output below.

get_sentiment(speech[3:24])

##  [1]  0.50  0.00  9.00  2.75  4.05  2.85  5.30  5.50  3.20  3.05  3.90  2.70
## [13]  9.05  2.60  3.00  5.00  6.10  5.40  3.15  1.20 -0.25  3.80

With get_nrc_sentiment it calculates eight different emotions and the sentiments in the text file.

knitr::kable(get_nrc_sentiment(speech[3:24]))

anger	anticipation	disgust	fear	joy	sadness	surprise	trust	negative	positive
1	2	0	1	0	0	0	4	0	2
0	0	0	0	0	0	0	0	0	0
1	3	0	2	2	1	0	10	1	13
0	0	0	1	2	0	0	3	1	5
0	1	0	0	0	0	0	6	0	7
0	2	0	1	4	0	0	3	1	5
0	1	1	0	2	0	1	7	0	5
0	6	0	1	5	1	2	8	0	8
0	3	0	0	1	1	0	4	1	6
0	4	0	1	2	0	0	4	0	9
0	1	0	2	3	1	0	4	1	4
0	0	0	0	1	1	1	2	0	4
0	2	0	0	4	0	1	4	0	9
0	0	0	0	0	0	0	4	0	1
0	3	0	0	1	0	0	5	1	7
1	3	0	1	2	0	2	3	0	5
2	4	1	3	3	0	3	6	2	12
1	1	0	1	0	0	0	2	0	3
0	1	0	1	0	0	1	2	0	3
1	2	0	2	0	1	0	5	2	3
1	1	0	0	0	1	0	1	2	2
2	5	0	2	1	1	0	2	2	6

s_v <- get_sentences(speech[3:24])
s_v_sentiment <- get_sentiment(s_v)
s_v_sentiment

##  [1]  0.50  0.00  3.95  5.05  2.75  1.25  1.00  2.30  0.50  1.85  1.00  1.00
## [13]  4.30  1.50  1.60  1.65  0.25  0.75  0.50  0.00  0.85  1.60  0.25  1.15
## [25]  0.35 -0.50  2.05  1.65  0.95  1.30  0.75  1.80  0.40  0.50  1.60  1.40
## [37]  2.40  2.05  1.60  0.75  0.00  0.80  1.80  2.00  1.00  0.00  2.25  0.00
## [49]  1.50  1.25  5.10  1.00  1.50  2.10  3.30  0.80  3.15  0.00  0.00  1.20
## [61]  0.25 -0.50  0.00  0.80  3.00

Loughran

English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

knitr::kable(get_sentiments("loughran") %>% count(sentiment, sort = TRUE) )

sentiment	n
negative	2355
litigious	904
positive	354
uncertainty	297
constraining	184
superfluous	56

The speech is mainly in the body from 3 to 24, so I slice the body of the webpage and save it to tidy_speech. Now I need to extract all the words, I strip the empty space and convert it into a data frame. Each row is a single word in the speech.

tidy_speech <- speech[3:24]

tidy_speech_words <- unlist(as.list(strsplit(tidy_speech, " ")))
rowNumber <- seq(1:length(tidy_speech_words))
words.df <- data.frame(rowNumber, tidy_speech_words)
names(words.df) <- c("rowNumber","word")

Just like the textbook example, I inner join the words in the loughran sentiment to the speech.

speech_sentiment_loughran <- words.df %>% inner_join(get_sentiments("loughran"))

## Joining, by = "word"

To count the amount of words per sentiment type I again followed the textbook. Inner join the words from the speech with the lexicon, then count the words by the sentiment and sort it. Next I graphed it using ggplot.

lang_word_counts <- words.df %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

lang_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Here you can see uncertainty was the largest sentiment type in the speech followed by positive. Loughran was able to identify some litigious words, which is expected from a speech for a potential Supreme Court judge.

Conclusion

When using lexicons it is important to note how binary the scoring of the words are and where the source of the text came from. Loughran for example was based on and mainly used for financial documents. The textbook shows how different lexicons can distribute sentiments differently. A possible extention to the new corpus is to compare the sentiment with different lexicons like the textbook did. Context also matters as it can change the meaning of words than just looking at them atomically.

References

Silge, Julia, and David Robinson. “Text Mining with R: A Tidy Approach”. O’Reilly Media, 2017

Homework 10

Moiya Josephs

4/7/2022