For this assignment, we will be exploring and building off of the code presented in chapter 2 of the web textbook, Text Mining with R.The first part of this assignment is taken directly from the book example code. From there, we will be Working with a different corpus of our choosing, and Incorporate at least one additional sentiment lexicon which we can discover through research (potentially from another R package).
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidytext)## Warning: package 'tidytext' was built under R version 4.2.3
library(textdata)## Warning: package 'textdata' was built under R version 4.2.3
library(janeaustenr)## Warning: package 'janeaustenr' was built under R version 4.2.3
library(wordcloud)## Warning: package 'wordcloud' was built under R version 4.2.3
## Loading required package: RColorBrewer
library(reshape2)## Warning: package 'reshape2' was built under R version 4.2.3
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(gutenbergr)## Warning: package 'gutenbergr' was built under R version 4.2.3
library(openintro)## Warning: package 'openintro' was built under R version 4.2.3
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.2.3
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.2.3
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.2.3
##
## Attaching package: 'openintro'
##
## The following object is masked from 'package:reshape2':
##
## tips
Obtain sentiment lexicons from three different sources: AFINN, Bing, and NRC.
Note: If you initially encounter problems loading AFINN, bing, or nrc, you will need to accept the license for the lexicon by typing in the console for R Markdown
afinn<- get_sentiments("afinn")
bing<- get_sentiments("bing")
nrc<-get_sentiments("nrc")in the code below, we use the austen_books() function
from the janeaustenr package to extract text from Jane
Austen’s novels and prepare it for analysis by splitting it into
individual words using the unnest_tokens() function.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)Next, we filter the NRC sentiment lexicon to include only words with
a “joy” sentiment, then use the inner_join() function to
merge this lexicon with the tidy text data frame. The resulting data
frame is then filtered to include only words from “Emma” and is counted
using count() to show the frequency of words with a “joy”
sentiment.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # … with 291 more rows
Now, we join the tidy text data frame with the Bing sentiment lexicon
using inner_join(). We then use count() and
pivot_wider() functions to count the number of positive and
negative words in each book, grouped by sections of 80 lines. Finally,
the ggplot() function is used to create bar charts that
show the sentiment score over the plot trajectory of each novel. The
chart is facet-wrapped by book, and the sentiment score is calculated as
the difference between the number of positive and negative words.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Counts the frequency of words in a text dataset categorized by sentiment using the bing lexicon
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining, by = "word"
Visualizes the top 10 positive and negative words using the bing lexicon in a bar plot.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)Creates a custom list of stop words that includes the words “well”, ““, and”miss” by binding together a tibble of these words with the standard list of stop words.
custom_stop_words <- bind_rows(tibble(word = c("well", "", "miss"),
lexicon = c("custom")),
stop_words)My Bondage and My Freedom by Frederick Douglass
We researched the book by its ID number in the project Gutenberg
We will analyze text My Bondage and My Freedom, autobiographical slave narrative by Frederick Douglass. We will use the gutenbergr library to search and download it.
bondage_count <- gutenberg_download(202)## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
bondage_count## # A tibble: 12,324 × 2
## gutenberg_id text
## <int> <chr>
## 1 202 "MY BONDAGE and MY FREEDOM"
## 2 202 ""
## 3 202 "By Frederick Douglass"
## 4 202 ""
## 5 202 ""
## 6 202 "By a principle essential to Christianity, a PERSON is eternall…
## 7 202 "differenced from a THING; so that the idea of a HUMAN BEING,"
## 8 202 "necessarily excludes the idea of PROPERTY IN THAT BEING. —COLE…
## 9 202 ""
## 10 202 "Entered according to Act of Congress in 1855 by Frederick Doug…
## # … with 12,314 more rows
#removing the first 763 rows of text which are table of contents
bondage_count <- bondage_count[c(763:nrow(bondage_count)),]
#using unnest_tokens to have each line be broken into individual rows.
bondage <- bondage_count %>% unnest_tokens(word, text)
bondage## # A tibble: 129,096 × 2
## gutenberg_id word
## <int> <chr>
## 1 202 chapter
## 2 202 i
## 3 202 _childhood_
## 4 202 place
## 5 202 of
## 6 202 birth
## 7 202 character
## 8 202 of
## 9 202 the
## 10 202 district
## # … with 129,086 more rows
bondage_index <- bondage_count %>%
filter(text != "") %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("(?<=Chapter )([\\dII]{1,3})", ignore_case = TRUE))))
bondage_index## # A tibble: 10,716 × 4
## gutenberg_id text linen…¹ chapter
## <int> <chr> <int> <int>
## 1 202 CHAPTER I. _Childhood_ 1 1
## 2 202 PLACE OF BIRTH—CHARACTER OF THE DISTRICT—TUCKAH… 2 1
## 3 202 NAME—CHOPTANK RIVER—TIME OF BIRTH—GENEALOGICAL … 3 1
## 4 202 TIME—NAMES OF GRANDPARENTS—THEIR POSITION—GRAND… 4 1
## 5 202 ESTEEMED—“BORN TO GOOD LUCK”—SWEET POTATOES—SUP… 5 1
## 6 202 CABIN—ITS CHARMS—SEPARATING CHILDREN—MY AUNTS—T… 6 1
## 7 202 KNOWLEDGE OF BEING A SLAVE—OLD MASTER—GRIEFS AN… 7 1
## 8 202 CHILDHOOD—COMPARATIVE HAPPINESS OF THE SLAVE-BO… 8 1
## 9 202 SLAVEHOLDER. 9 1
## 10 202 In Talbot county, Eastern Shore, Maryland, near… 10 1
## # … with 10,706 more rows, and abbreviated variable name ¹linenumber
bondage %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment == "positive") %>%
count(word, sentiment, sort = TRUE) %>%
top_n(10) %>%
mutate(word = reorder(word, desc(n))) %>%
ggplot() +
aes(x = word, y = n) +
labs(title = "Most Frequent Positive Words") +
ylab("Count") +
xlab("Word") +
geom_col() +
geom_text(aes(label = n, vjust = -.5)) +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(hjust = 0.5)
)## Joining, by = "word"
## Selecting by n
bondage %>%
inner_join(get_sentiments("bing")) %>%
filter(sentiment == "negative") %>%
count(word, sentiment, sort = TRUE) %>%
top_n(10) %>%
mutate(word = reorder(word, desc(n))) %>%
ggplot() +
aes(x = word, y = n) +
labs(title = "Most Frequent Negative Words") +
ylab("Count") +
xlab("Word") +
geom_col() +
geom_text(aes(label = n, vjust = -.5)) +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(hjust = 0.5)
)## Joining, by = "word"
## Selecting by n
Let’s look at the most common words in Frederick Douglasss’s book with wordcloud.
library(RColorBrewer)
# Color palette for the wordclouds
colors <- brewer.pal(8, "Dark2")
# Wordcloud of non-stopwords
bondage %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, color = colors))## Joining, by = "word"
Above the most common words in Frederick Douglasss’s autobiographical slave narrativ.
# Sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words
bondage %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = colors,
max.words = 100)## Joining, by = "word"
The size of a word’s text above is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.
We will use loughran lexicon that we have researched.
Note: If you initially encounter problems loading loughran, you will need to accept the license for the lexicon by typing in the console for R Markdown
lghrn <- get_sentiments("loughran")
unique(lghrn$sentiment)## [1] "negative" "positive" "uncertainty" "litigious" "constraining"
## [6] "superfluous"
#let’s explore the lexicon to see what types of words are litigious and constraining.
bondage_index %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("loughran")) %>%
filter(sentiment %in% c("litigious", "constraining")) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ggplot() +
aes(x = reorder(word,desc(n)), y = n) +
geom_col() +
facet_grid(~sentiment, scales = "free_x") +
geom_text(aes(label = n, vjust = -.5)) +
labs(title = "Words Associated with Litigious and Constraining") +
ylab("Count") +
xlab("Word") +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(hjust = 0.5)
)## Joining, by = "word"
## Selecting by n
This assignment has allowed us to explore the topic of sentiment analysis. We have successfully implemented and expanded upon the main example code from chapter 2 of the Text Mining with R book. We have used three different sentiment lexicons: ‘AFINN’, ‘bing’, and ‘nrc’ to analyze the sentiment of Jane Austen’s novels. Further, by using the ‘gutenbergr’ library we have explored the “My Bondage and My Freedom” by Frederick Douglass. We have tidied the dataset with one-token-per-row by using the unnest_tokens () function. We have used our sentiment analysis with inner join to be able to find the most frequent positive words and most frequent negative words. From our findings, we can see the respective most frequent positive and negative words are master and slave. These two words are also look like the most common words using wordcloud. Moreover, we filter the ‘loughran’ sentiment lexicon to include only words with a “litigious” and “constraining” sentiment. The resulting data frame is then filtered to include only words from “bondage” and is counted using count () to show the frequency of words with a “litigious” and “constraining” sentiments. Form here we can see that the words associated with Litigious results is more then constraining.