Section 1:

This section shows the walkthrough of the example seen in Chapter 2 of “Text Mining with R.”

In this chunk of code, the data frames are shown for a few of the sentiment lexicons.

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

Below, you can see the code they used to convert it into a tidy format and then track the location of each word.

library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.3.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

In the code below, the words that were considered to be in the category “joy” were filtered out of the sentiment lexicon, and these were applied to the book “Emma” to see which joyful words were found in that book.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

When you analyze large sections of text, it is recommended to use fewer lines than the full text. The code below used 80 lines and separated positives and negatives into separate columns.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Below, you can see the code that was used to plot the sentiment scores calculated in the last chunk of code. This shows the information for different books.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

In the chunk below, the book “Pride and Prejudice” was pulled out of the original data frame.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 × 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ℹ 122,194 more rows

In this chunk, the code finds the net sentiment in each part of the book text.

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

The three sentiment lexicons were plotted below for comparison.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The two lines below show the number of positive and negative words in nrc and bing.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

In this chunk of code, you can see how to count the words that appear multiple times in a book.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ℹ 2,575 more rows

Use the code below to plot the counts for negative and positive words.

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Some words are assigned negative or positive incorrectly based on the context of a sentence. The code below shows how to add a word to a list.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ℹ 1,140 more rows

The library wordcloud can be used to create clouds of words showing the most common words.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.2
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

In the library reshape2, you can create a word cloud that compares two types of words.

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

This code shows how to look at sentences as a whole instead of just words.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

p_and_p_sentences$sentence[2]
## [1] "by jane austen"

In the chunk of code below, you can see how the chapters were added to a data frame.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

You can use the code below to get negative words in the novels and make a data frame of the number of negative words in each specified chapter.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Section 2

This section shows sentiment analysis of an article I chose.

The article is titled “Data Science and the Art of Persuasion.” The article talks about how organizations struggle to communicate insights about the information they collected. The article explains why this occurs and how it can be fixed. The author was Scott Berinato.

Steps:

  1. I loaded all the libraries used in the code. I included libraries that were loaded previously in the top section to allow for duplication of just this section of code.
# Load libraries
library(readr)
library(tidyr)
library(stringr)
library(dplyr)
library(wordcloud)
library(tidytext)
library(ggplot2)
  1. I imported the article as a text file and then converted it into a data frame.
# Import the article as text and convert to data frame
article <- read_file("https://raw.githubusercontent.com/juliaDataScience-22/cuny-fall-23/manage-acquire-data/Data%20Science%20Article.txt")

article <- as.data.frame(article)
  1. I tidied the data in the data frame. Some characters and symbols were not helpful, so I removed those from the words.
#Tidy the data and remove unwanted characters

article <- separate_longer_delim(article, article, delim = "\r\n")
article <- separate_longer_delim(article, article, delim = " ")
article <- separate_longer_delim(article, article, delim = "/")
article <- separate_longer_delim(article, article, delim = "-")
article <- separate_longer_delim(article, article, delim = "—")
article <- separate_longer_delim(article, article, delim = "….")

article[article == ''] <- NA
article <-
  article |> 
  na.omit()

article$article <- gsub("\\.$", "", article$article)
article$article <- gsub("\\,$", "", article$article)

article$article <- iconv(article$article, from = 'UTF-8', to = 'ASCII//TRANSLIT')
article$article <- str_remove_all(article$article, '\"')

article$article <- gsub("^\\:", "", article$article)
article$article <- gsub("\\:$", "", article$article)
article$article <- gsub("\\,\"$", "", article$article)

article$article <- gsub("\\,$", "", article$article)
article$article <- gsub("\\.$", "", article$article)
article$article <- gsub("\\.)$", "", article$article)
article$article <- gsub("\\.'$", "", article$article)
article$article <- gsub("U.S", "U.S.", article$article)
article$article <- gsub("\\;$", "", article$article)
article$article <- gsub("^\\[", "", article$article)
article$article <- gsub("\\]$", "", article$article)
article$article <- gsub("^\\(", "", article$article)
article$article <- gsub("\\)$", "", article$article)
article$article <- gsub("\\?$", "", article$article)
article$article <- gsub("^\\'", "", article$article)
article$article <- gsub("\\'$", "", article$article)
article$article <- gsub("\\-", "", article$article)

article$article <- tolower(article$article)
  1. I created the data frame of all unique words.
# Create the data frame and add all the unique words
articleWords <- data.frame(word = c(1:length(unique(article$article))),
                           count = c(1:length(unique(article$article))))
articleWords$word <- unique(sort(article$article))
  1. I counted the number of occurrences of each word and added the counts to the data frame from Step 4.
# Count each word
num <- 1
for (myWord in articleWords$word)
{
  total <- 
    article |> 
    filter(article == myWord) |> 
    count() |> 
    as.integer()
  articleWords$count[num] <- total
  num <- num + 1
}

articleWords <-
  articleWords |>
  arrange(desc(count))
  1. I created a list of top words that were not important. These words were not helpful in understanding the words I wanted to focus on. They were not related to data science. They were mostly words used in between the words that related to the main topic of the article. I only removed these words if they were in the top 100 based on count. After creating that list, I removed them from the data frame.
# Create a list of top words that were not important

toDelete <- c("the", "to", "and", "a", "of", "in", "for", "it", "that", "is", "with", "on", "will", "they", "have", "but", "are", "as", "who", "from", "their", "or", "them", "this", "one", "be", "not", "he", "it's", "at", "i", "an", "can", "some", "his", "most", "about", "all", "do", "how", "what", "you", "by", "those", "when", "also", "don't", "good", "her", "last", "make", "may", "more", "part", "person", "because", "has", "other", "up", "were", "even", "if", "many", "need", "no", "say", "says", "way", "well", "would", "any", "find", "into", "over", "same", "should", "used", "was", "we", "aren't", "both", "can't", "could", "getting", "just", "makers", "my", "new", "often", "talents", "set", "that's", "these", "they're", "use", "which", "your", "close", "free", "gap", "get", "hard", "isn't", "its", "know", "lay", "lead", "learn", "like", "managers", "might", "much", "needs", "now", "only", "out", "see", "so", "take", "than", "three", "want", "where")
articleWords <- articleWords[-(which(articleWords$word %in% toDelete)),]
  1. I made the word cloud of the top 100 words that I considered most relevant since the previous step removed words that were not relevant to the article.

As you can see, some of the top words were “data”, “scientists”, “team”, “analysis”, “talent”, “communication”, “business”, “project”, and so on. These words all relate to data science, so they make complete sense based on the topic of the article. I also think it is important to note the high count of team, support, and communication. Data science is not work that is done alone. People work together to come to a final result, and this article shows that with these words. You can get a lot out of the word cloud without even reading the article, which is interesting!

# Create the word cloud

wordcloud(articleWords$word, articleWords$count, max.words = 100)

  1. Lastly, I created bar graphs of words based on the categories in the sentiment lexicon called loughran.

It is clear to see that both positive and negative words appeared with simialar frequencies. This makes sense because the article first described a problem, and then described a possible solution to this problem. The negative and positive words were split up because of this format. The other categories were not as prevalent in the article except for “uncertainty,” which appeared a few times throughout the article.

Please note that in the graphs below, not all words from the article were included. Only the words that were part of the sentiment lexicon called loughran were pulled out of the article if present, so some of the words in the article that should have fit into these categories did not appear in the graphs.

# Show the bar graphs for the most common data science words by category

articleWordsFinal <- articleWords %>%
     inner_join(get_sentiments("loughran"))
## Joining with `by = join_by(word)`
articleWordsFinal %>%
     group_by(sentiment) %>%
     slice_max(count, n = 10) %>% 
     ungroup() %>%
     mutate(word = reorder(word, count)) %>%
     ggplot(aes(count, word, fill = sentiment)) +
     geom_col(show.legend = FALSE) +
     facet_wrap(~sentiment, scales = "free_y") +
     labs(x = "Contribution to sentiment",
          y = NULL,
          title = "Categories of Words Based on the Loughran Sentiment Lexicon")

Sources:

https://www.tidytextmining.com/sentiment.html

https://sparkbyexamples.com/r-programming/r-import-text-file-as-a-string/

https://stackoverflow.com/questions/50861626/removing-period-from-the-end-of-string

https://stackoverflow.com/questions/21781014/remove-all-line-breaks-enter-symbols-from-the-string-using-r

https://sparkbyexamples.com/r-programming/replace-empty-string-with-na-in-r-dataframe/

https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r

https://www.geeksforgeeks.org/convert-string-from-uppercase-to-lowercase-in-r-programming-tolower-method/

https://stackoverflow.com/questions/75084373/how-to-remove-rows-by-condition-in-r