Data Mining

Instructions

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, we’ll start by getting the primary example code from chapter 2 working in an R Markdown document. Then, we’ll work with a different corpus and incorporate at least one additional sentiment lexicon.

References

  • Hu, M., & Liu, B. (2004). Mining and Summarizing Customer Reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168-177.

  • Loughran, Tim, and Bill McDonald. “When is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance, vol. 66, no. 1, 2011, pp. 35-65.

  • Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence, 29(3), 436-465.

  • Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’, 93-98.

  • Silge, J., & Robinson, D. (2017). Chapter 1 & 2. In Text Mining with R: A Tidy Approach. O’Reilly Media.

Primary Example Code

Text Mining with R (Chapter 1 & 2)

The entire following section is code from the the book Text Mining with R. Which helps us understand the lexicons for sentiment analysis and how can the code be extended from there.

Tidying the works of Jane Austen

library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()

Restructuring it in the one-token-per-row format.

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

Remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`

Couting the most common words in Jane Austen’s books

tidy_books %>%
  count(word, sort = TRUE) 
## # A tibble: 13,914 × 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # ℹ 13,904 more rows

Visualizing previous count

library(ggplot2)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

Adding on more books using Gutenbergr library

library(gutenbergr)

hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
## Warning: ! Could not download a book at http://aleph.gutenberg.org/1/5/159/159.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_hgwells %>%
  count(word, sort = TRUE)
## # A tibble: 10,320 × 2
##    word          n
##    <chr>     <int>
##  1 time        396
##  2 people      249
##  3 door        224
##  4 kemp        213
##  5 invisible   197
##  6 black       178
##  7 stood       174
##  8 night       168
##  9 heard       167
## 10 hall        165
## # ℹ 10,310 more rows

And more books

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_bronte %>%
  count(word, sort = TRUE)
## # A tibble: 23,213 × 2
##    word       n
##    <chr>  <int>
##  1 time    1065
##  2 miss     854
##  3 day      825
##  4 don’t    780
##  5 hand     767
##  6 eyes     714
##  7 night    648
##  8 heart    638
##  9 looked   601
## 10 door     591
## # ℹ 23,203 more rows

Calculating the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together.

library(tidyr)

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer(`Brontë Sisters`:`H.G. Wells`,
               names_to = "author", values_to = "proportion")

frequency
## # A tibble: 55,526 × 4
##    word      `Jane Austen` author          proportion
##    <chr>             <dbl> <chr>                <dbl>
##  1 a            0.00000919 Brontë Sisters  0.0000665 
##  2 a            0.00000919 H.G. Wells      0.0000191 
##  3 aback       NA          Brontë Sisters  0.00000391
##  4 aback       NA          H.G. Wells      0.0000191 
##  5 abaht       NA          Brontë Sisters  0.00000391
##  6 abaht       NA          H.G. Wells     NA         
##  7 abandon     NA          Brontë Sisters  0.0000313 
##  8 abandon     NA          H.G. Wells      0.0000191 
##  9 abandoned    0.00000460 Brontë Sisters  0.0000899 
## 10 abandoned    0.00000460 H.G. Wells      0.000191  
## # ℹ 55,516 more rows

Quantify how similar and different these sets of word frequencies are using a correlation test

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 110.73, df = 10275, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7286645 0.7462983
## sample estimates:
##       cor 
## 0.7376071
cor.test(data = frequency[frequency$author == "H.G. Wells",], 
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 33.563, df = 5457, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3914204 0.4354038
## sample estimates:
##       cor 
## 0.4136534

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.

library(textdata)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

“Let’s look at the words with a joy score from the NRC lexicon. What are the most common joy words in Emma? First, we need to take the text of the novels and convert the text to the tidy format using unnest_tokens(), just as we did in Section 1.3. Let’s also set up some other columns to keep track of which line and chapter of the book each word comes from; we use group_by and mutate to construct those columns.”

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

“Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.”

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

“We then use pivot_wider() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).” “Now we can plot these sentiment scores across the plot trajectory of each novel.”

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the three sentiment dictionaries

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

“Let’s look briefly at how many positive and negative words are in these lexicons.”

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

“By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.”

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

” Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud in Figure 2.5.”

library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

“Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format”

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

“First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words?”

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Extending the code

Once that we have reviewed the lexicons mentioned in Chapter 2, we can now include our own corpus and lexicon to apply a brief analysis.

For this assignment we will be using the following works by Sigmund Freud:

sfreud <- gutenberg_download(c(38219, 15489))

# Tidy the text data
tidy_sfreud <- sfreud %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
# Count the most frequent words in the tidy data
tidy_sfreud %>%
  count(word, sort = TRUE)
## # A tibble: 11,489 × 2
##    word            n
##    <chr>       <int>
##  1 dream        1955
##  2 dreams        563
##  3 sexual        470
##  4 psychic       416
##  5 life          375
##  6 time          350
##  7 unconscious   327
##  8 libido        267
##  9 analysis      255
## 10 patient       251
## # ℹ 11,479 more rows

And we will be comparing the sentiments from the lexicons NRC and Loughran-McDonald. The Loughran-McDonald lexicon categorizes words into six primary sentiment classes: positive, negative, uncertainty, litigious, constraining, and superfluous.

# Perform sentiment analysis using the NRC Emotion Lexicon
nrc_sentiment <- tidy_sfreud %>%
  inner_join(get_sentiments("nrc")) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(lexicon = "NRC")
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 17 of `x` matches multiple rows in `y`.
## ℹ Row 8058 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Perform sentiment analysis using the Loughran-McDonald Lexicon
loughran_sentiment <- tidy_sfreud %>%
  inner_join(get_sentiments("loughran")) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(lexicon = "Loughran-McDonald")
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 548 of `x` matches multiple rows in `y`.
## ℹ Row 2462 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Combine the results for comparison
combined_sentiment <- bind_rows(nrc_sentiment, loughran_sentiment)

# Visualize the comparison
ggplot(combined_sentiment, aes(x = reorder(sentiment, n), y = n, fill = lexicon)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Comparison of Sentiment Analysis Using NRC and Loughran-McDonald Lexicons",
       x = "Sentiment",
       y = "Count",
       fill = "Lexicon") +
  theme_minimal()

- Both the NRC and Loughran-McDonald lexicons show a significant number of negative and positive words. This could mean that Freud’s writing contains a blend of emotional language, oscillating between themes of optimism and pessimism, which is consistent with psychoanalytic discussions of human conflict and internal struggle.

Let’s see now how does the emotional arc of Freud’s books changes over time in his work. We can use the nrc lexicon for this, showing how the emotional tone changes throughout the text.

# Calculate the emotional arc using the NRC lexicon
emotional_arc <- tidy_sfreud %>%
  inner_join(get_sentiments("nrc")) %>%
  count(index = row_number(), sentiment) %>%
  spread(sentiment, n, fill = 0)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 17 of `x` matches multiple rows in `y`.
## ℹ Row 8058 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Plot the emotional arc
ggplot(emotional_arc, aes(x = index)) +
  geom_line(aes(y = anger, color = "Anger")) +
  geom_line(aes(y = joy, color = "Joy")) +
  geom_line(aes(y = fear, color = "Fear")) +
  geom_line(aes(y = sadness, color = "Sadness")) +
  labs(title = "Emotional Arc of Freud's Books",
       x = "Narrative Time",
       y = "Emotion Count") +
  theme_minimal()

- The emotional arc is heavily dominated by the “sadness” category, which is expected in a text dealing with complex psychological issues, trauma, and the unconscious mind. The high frequency of sadness may reflect Freud’s emphasis on the darker aspects of the human psyche.