Introduction

For this assignment, we will be exploring and building off of the code presented in chapter 2 of the web textbook, Text Mining with R.The first part of this assignment is taken directly from the book example code. From there, we will be Working with a different corpus of our choosing, and Incorporate at least one additional sentiment lexicon which we can discover through research (potentially from another R package).

Loading Required Libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.2.3

library(textdata)

## Warning: package 'textdata' was built under R version 4.2.3

library(janeaustenr)

## Warning: package 'janeaustenr' was built under R version 4.2.3

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.2.3

## Loading required package: RColorBrewer

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.2.3

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(gutenbergr)

## Warning: package 'gutenbergr' was built under R version 4.2.3

library(openintro)

## Warning: package 'openintro' was built under R version 4.2.3

## Loading required package: airports

## Warning: package 'airports' was built under R version 4.2.3

## Loading required package: cherryblossom

## Warning: package 'cherryblossom' was built under R version 4.2.3

## Loading required package: usdata

## Warning: package 'usdata' was built under R version 4.2.3

## 
## Attaching package: 'openintro'
## 
## The following object is masked from 'package:reshape2':
## 
##     tips

The Sentiment Datasets

Obtain sentiment lexicons from three different sources: AFINN, Bing, and NRC.

Note: If you initially encounter problems loading AFINN, bing, or nrc, you will need to accept the license for the lexicon by typing in the console for R Markdown

afinn<- get_sentiments("afinn")
bing<- get_sentiments("bing")
nrc<-get_sentiments("nrc")

Code Example From Text Book

Sentiment Analysis with Inner Join

in the code below, we use the austen_books() function from the janeaustenr package to extract text from Jane Austen’s novels and prepare it for analysis by splitting it into individual words using the unnest_tokens() function.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Next, we filter the NRC sentiment lexicon to include only words with a “joy” sentiment, then use the inner_join() function to merge this lexicon with the tidy text data frame. The resulting data frame is then filtered to include only words from “Emma” and is counted using count() to show the frequency of words with a “joy” sentiment.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows

Now, we join the tidy text data frame with the Bing sentiment lexicon using inner_join(). We then use count() and pivot_wider() functions to count the number of positive and negative words in each book, grouped by sections of 80 lines. Finally, the ggplot() function is used to create bar charts that show the sentiment score over the plot trajectory of each novel. The chart is facet-wrapped by book, and the sentiment score is calculated as the difference between the number of positive and negative words.

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing The Three Sentiment Dictionaries

Filter Data

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

Calculate Sentiment Scores

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

Visualize Sentiment Scores

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Most Common Positive and Negative Words

Counts the frequency of words in a text dataset categorized by sentiment using the bing lexicon

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

Visualizes the top 10 positive and negative words using the bing lexicon in a bar plot.

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Creates a custom list of stop words that includes the words “well”, ““, and”miss” by binding together a tibble of these words with the standard list of stop words.

custom_stop_words <- bind_rows(tibble(word = c("well", "", "miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

Corpus

My Bondage and My Freedom by Frederick Douglass

We researched the book by its ID number in the project Gutenberg

We will analyze text My Bondage and My Freedom, autobiographical slave narrative by Frederick Douglass. We will use the gutenbergr library to search and download it.

The Sentiment Dataset

bondage_count <- gutenberg_download(202)

## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

bondage_count

## # A tibble: 12,324 × 2
##    gutenberg_id text                                                            
##           <int> <chr>                                                           
##  1          202 "MY BONDAGE and MY FREEDOM"                                     
##  2          202 ""                                                              
##  3          202 "By Frederick Douglass"                                         
##  4          202 ""                                                              
##  5          202 ""                                                              
##  6          202 "By a principle essential to Christianity, a PERSON is eternall…
##  7          202 "differenced from a THING; so that the idea of a HUMAN BEING,"  
##  8          202 "necessarily excludes the idea of PROPERTY IN THAT BEING. —COLE…
##  9          202 ""                                                              
## 10          202 "Entered according to Act of Congress in 1855 by Frederick Doug…
## # … with 12,314 more rows

Tidying the Works of Frederick Douglass

#removing the first 763 rows of text which are table of contents
bondage_count <- bondage_count[c(763:nrow(bondage_count)),]

#using unnest_tokens to have each line be broken into individual rows. 
bondage <- bondage_count %>% unnest_tokens(word, text)
bondage

## # A tibble: 129,096 × 2
##    gutenberg_id word       
##           <int> <chr>      
##  1          202 chapter    
##  2          202 i          
##  3          202 _childhood_
##  4          202 place      
##  5          202 of         
##  6          202 birth      
##  7          202 character  
##  8          202 of         
##  9          202 the        
## 10          202 district   
## # … with 129,086 more rows

bondage_index <- bondage_count %>% 
  filter(text != "") %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("(?<=Chapter )([\\dII]{1,3})", ignore_case =  TRUE)))) 
bondage_index

## # A tibble: 10,716 × 4
##    gutenberg_id text                                             linen…¹ chapter
##           <int> <chr>                                              <int>   <int>
##  1          202 CHAPTER I. _Childhood_                                 1       1
##  2          202 PLACE OF BIRTH—CHARACTER OF THE DISTRICT—TUCKAH…       2       1
##  3          202 NAME—CHOPTANK RIVER—TIME OF BIRTH—GENEALOGICAL …       3       1
##  4          202 TIME—NAMES OF GRANDPARENTS—THEIR POSITION—GRAND…       4       1
##  5          202 ESTEEMED—“BORN TO GOOD LUCK”—SWEET POTATOES—SUP…       5       1
##  6          202 CABIN—ITS CHARMS—SEPARATING CHILDREN—MY AUNTS—T…       6       1
##  7          202 KNOWLEDGE OF BEING A SLAVE—OLD MASTER—GRIEFS AN…       7       1
##  8          202 CHILDHOOD—COMPARATIVE HAPPINESS OF THE SLAVE-BO…       8       1
##  9          202 SLAVEHOLDER.                                           9       1
## 10          202 In Talbot county, Eastern Shore, Maryland, near…      10       1
## # … with 10,706 more rows, and abbreviated variable name ¹linenumber

Sentiment Analysis With Inner Join

Most Frequent Positive Words

bondage %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(word = reorder(word, desc(n))) %>%
  ggplot() + 
  aes(x = word, y = n) +
  labs(title = "Most Frequent Positive Words") + 
  ylab("Count") + 
  xlab("Word") +
  geom_col() + 
  geom_text(aes(label = n, vjust = -.5)) + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

## Joining, by = "word"
## Selecting by n

Most Frequent Negative Words

bondage %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(word = reorder(word, desc(n))) %>%
  ggplot() + 
  aes(x = word, y = n) +
  labs(title = "Most Frequent Negative Words") + 
  ylab("Count") + 
  xlab("Word") +
  geom_col() + 
  geom_text(aes(label = n, vjust = -.5)) + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

## Joining, by = "word"
## Selecting by n

Wordclouds

Let’s look at the most common words in Frederick Douglasss’s book with wordcloud.

library(RColorBrewer)
# Color palette for the wordclouds
colors <- brewer.pal(8, "Dark2")
# Wordcloud of non-stopwords
bondage %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, color = colors))

## Joining, by = "word"

Above the most common words in Frederick Douglasss’s autobiographical slave narrativ.

# Sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words
bondage %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = colors,
                   max.words = 100)

## Joining, by = "word"

The size of a word’s text above is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.

Loughran Sentiment Lexicon

We will use loughran lexicon that we have researched.

Note: If you initially encounter problems loading loughran, you will need to accept the license for the lexicon by typing in the console for R Markdown

lghrn <- get_sentiments("loughran")
unique(lghrn$sentiment)

## [1] "negative"     "positive"     "uncertainty"  "litigious"    "constraining"
## [6] "superfluous"

#let’s explore the lexicon to see what types of words are litigious and constraining.
bondage_index %>% 
  unnest_tokens(word, text) %>% 
  inner_join(get_sentiments("loughran")) %>%
  filter(sentiment %in% c("litigious", "constraining")) %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ggplot() + 
  aes(x = reorder(word,desc(n)), y = n) + 
  geom_col() +
  facet_grid(~sentiment, scales = "free_x")  + 
  geom_text(aes(label = n, vjust = -.5)) + 
  labs(title = "Words Associated with Litigious and Constraining") + 
  ylab("Count") + 
  xlab("Word") + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

## Joining, by = "word"
## Selecting by n

Conclusion

This assignment has allowed us to explore the topic of sentiment analysis. We have successfully implemented and expanded upon the main example code from chapter 2 of the Text Mining with R book. We have used three different sentiment lexicons: ‘AFINN’, ‘bing’, and ‘nrc’ to analyze the sentiment of Jane Austen’s novels. Further, by using the ‘gutenbergr’ library we have explored the “My Bondage and My Freedom” by Frederick Douglass. We have tidied the dataset with one-token-per-row by using the unnest_tokens () function. We have used our sentiment analysis with inner join to be able to find the most frequent positive words and most frequent negative words. From our findings, we can see the respective most frequent positive and negative words are master and slave. These two words are also look like the most common words using wordcloud. Moreover, we filter the ‘loughran’ sentiment lexicon to include only words with a “litigious” and “constraining” sentiment. The resulting data frame is then filtered to include only words from “bondage” and is counted using count () to show the frequency of words with a “litigious” and “constraining” sentiments. Form here we can see that the words associated with Litigious results is more then constraining.

DATA 607: Week 10 Assignment 7

Waheeb Algabri and Farhana Akther