In this chunk of code, the data frames are shown for a few of the sentiment lexicons.
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.2
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
Below, you can see the code they used to convert it into a tidy format and then track the location of each word.
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.3.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
In the code below, the words that were considered to be in the category “joy” were filtered out of the sentiment lexicon, and these were applied to the book “Emma” to see which joyful words were found in that book.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
When you analyze large sections of text, it is recommended to use fewer lines than the full text. The code below used 80 lines and separated positives and negatives into separate columns.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Below, you can see the code that was used to plot the sentiment scores calculated in the last chunk of code. This shows the information for different books.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
In the chunk below, the book “Pride and Prejudice” was pulled out of the original data frame.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ℹ 122,194 more rows
In this chunk, the code finds the net sentiment in each part of the book text.
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
The three sentiment lexicons were plotted below for comparison.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The two lines below show the number of positive and negative words in nrc and bing.
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
In this chunk of code, you can see how to count the words that appear multiple times in a book.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ℹ 2,575 more rows
Use the code below to plot the counts for negative and positive words.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Some words are assigned negative or positive incorrectly based on the context of a sentence. The code below shows how to add a word to a list.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
The library wordcloud can be used to create clouds of words showing the most common words.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.2
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
In the library reshape2, you can create a word cloud that compares two types of words.
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
This code shows how to look at sentences as a whole instead of just words.
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
In the chunk of code below, you can see how the chapters were added to a data frame.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
You can use the code below to get negative words in the novels and make a data frame of the number of negative words in each specified chapter.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
The article is titled “Data Science and the Art of Persuasion.” The article talks about how organizations struggle to communicate insights about the information they collected. The article explains why this occurs and how it can be fixed. The author was Scott Berinato.
Steps:
# Load libraries
library(readr)
library(tidyr)
library(stringr)
library(dplyr)
library(wordcloud)
library(tidytext)
library(ggplot2)
# Import the article as text and convert to data frame
article <- read_file("https://raw.githubusercontent.com/juliaDataScience-22/cuny-fall-23/manage-acquire-data/Data%20Science%20Article.txt")
article <- as.data.frame(article)
#Tidy the data and remove unwanted characters
article <- separate_longer_delim(article, article, delim = "\r\n")
article <- separate_longer_delim(article, article, delim = " ")
article <- separate_longer_delim(article, article, delim = "/")
article <- separate_longer_delim(article, article, delim = "-")
article <- separate_longer_delim(article, article, delim = "—")
article <- separate_longer_delim(article, article, delim = "….")
article[article == ''] <- NA
article <-
article |>
na.omit()
article$article <- gsub("\\.$", "", article$article)
article$article <- gsub("\\,$", "", article$article)
article$article <- iconv(article$article, from = 'UTF-8', to = 'ASCII//TRANSLIT')
article$article <- str_remove_all(article$article, '\"')
article$article <- gsub("^\\:", "", article$article)
article$article <- gsub("\\:$", "", article$article)
article$article <- gsub("\\,\"$", "", article$article)
article$article <- gsub("\\,$", "", article$article)
article$article <- gsub("\\.$", "", article$article)
article$article <- gsub("\\.)$", "", article$article)
article$article <- gsub("\\.'$", "", article$article)
article$article <- gsub("U.S", "U.S.", article$article)
article$article <- gsub("\\;$", "", article$article)
article$article <- gsub("^\\[", "", article$article)
article$article <- gsub("\\]$", "", article$article)
article$article <- gsub("^\\(", "", article$article)
article$article <- gsub("\\)$", "", article$article)
article$article <- gsub("\\?$", "", article$article)
article$article <- gsub("^\\'", "", article$article)
article$article <- gsub("\\'$", "", article$article)
article$article <- gsub("\\-", "", article$article)
article$article <- tolower(article$article)
# Create the data frame and add all the unique words
articleWords <- data.frame(word = c(1:length(unique(article$article))),
count = c(1:length(unique(article$article))))
articleWords$word <- unique(sort(article$article))
# Count each word
num <- 1
for (myWord in articleWords$word)
{
total <-
article |>
filter(article == myWord) |>
count() |>
as.integer()
articleWords$count[num] <- total
num <- num + 1
}
articleWords <-
articleWords |>
arrange(desc(count))
# Create a list of top words that were not important
toDelete <- c("the", "to", "and", "a", "of", "in", "for", "it", "that", "is", "with", "on", "will", "they", "have", "but", "are", "as", "who", "from", "their", "or", "them", "this", "one", "be", "not", "he", "it's", "at", "i", "an", "can", "some", "his", "most", "about", "all", "do", "how", "what", "you", "by", "those", "when", "also", "don't", "good", "her", "last", "make", "may", "more", "part", "person", "because", "has", "other", "up", "were", "even", "if", "many", "need", "no", "say", "says", "way", "well", "would", "any", "find", "into", "over", "same", "should", "used", "was", "we", "aren't", "both", "can't", "could", "getting", "just", "makers", "my", "new", "often", "talents", "set", "that's", "these", "they're", "use", "which", "your", "close", "free", "gap", "get", "hard", "isn't", "its", "know", "lay", "lead", "learn", "like", "managers", "might", "much", "needs", "now", "only", "out", "see", "so", "take", "than", "three", "want", "where")
articleWords <- articleWords[-(which(articleWords$word %in% toDelete)),]
As you can see, some of the top words were “data”, “scientists”, “team”, “analysis”, “talent”, “communication”, “business”, “project”, and so on. These words all relate to data science, so they make complete sense based on the topic of the article. I also think it is important to note the high count of team, support, and communication. Data science is not work that is done alone. People work together to come to a final result, and this article shows that with these words. You can get a lot out of the word cloud without even reading the article, which is interesting!
# Create the word cloud
wordcloud(articleWords$word, articleWords$count, max.words = 100)
It is clear to see that both positive and negative words appeared with simialar frequencies. This makes sense because the article first described a problem, and then described a possible solution to this problem. The negative and positive words were split up because of this format. The other categories were not as prevalent in the article except for “uncertainty,” which appeared a few times throughout the article.
Please note that in the graphs below, not all words from the article were included. Only the words that were part of the sentiment lexicon called loughran were pulled out of the article if present, so some of the words in the article that should have fit into these categories did not appear in the graphs.
# Show the bar graphs for the most common data science words by category
articleWordsFinal <- articleWords %>%
inner_join(get_sentiments("loughran"))
## Joining with `by = join_by(word)`
articleWordsFinal %>%
group_by(sentiment) %>%
slice_max(count, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, count)) %>%
ggplot(aes(count, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL,
title = "Categories of Words Based on the Loughran Sentiment Lexicon")
https://www.tidytextmining.com/sentiment.html
https://sparkbyexamples.com/r-programming/r-import-text-file-as-a-string/
https://stackoverflow.com/questions/50861626/removing-period-from-the-end-of-string
https://sparkbyexamples.com/r-programming/replace-empty-string-with-na-in-r-dataframe/
https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r
https://stackoverflow.com/questions/75084373/how-to-remove-rows-by-condition-in-r