Introduction
For Assignment #10, I am building on an example from the Text Mining with R reading. In this example, I’ll extend the analysis to a new corpus and introduce a new sentiment lexicon, VADER. The first step involves retrieving sentiment scores using the AFINN, Bing, and NRC lexicons.
knitr::opts_chunk$set(echo = TRUE)
library(tidytext)
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
This example is using Jane Austen’s novels, so those novels will be imported, and the data will be tidied by grouping the text by book, numbering each line, and identifying chapter breaks based on common chapter headings. Finally, the text is split into individual words, making it ready for text analysis.
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Used the NRC lexicon and filter() for the joy words from the book Emma.
knitr::opts_chunk$set(echo = TRUE)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
inner_join() was used to perform the sentiment analysis.
knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
knitr::opts_chunk$set(echo = TRUE)
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ℹ 122,194 more rows
The three different lexicons are comapared using the novel pride and prejudice, which was filtered for. Then inner_join() is for bing and nrc since they both measure in a binary form. Where as afinn is numeric, so it is mutated.
knitr::opts_chunk$set(echo = TRUE)
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The count sentiment for each lexicon.
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
Most common potive and negative words
Are calculated by using the count data and then vizualized with word clouds in the following chunks.
knitr::opts_chunk$set(echo = TRUE)
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ℹ 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
Word Cloud
knitr::opts_chunk$set(echo = TRUE)
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
knitr::opts_chunk$set(echo = TRUE)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Looking at Units beyond Just Words
knitr::opts_chunk$set(echo = TRUE)
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
knitr::opts_chunk$set(echo = TRUE)
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
VADAR Lexicon
As an additional layer, I will include the VADER lexicon. While it is typically used for social media analysis, I thought it would fit Grimm’s Fairy Tales well because its ability to capture nuanced sentiments in text can reveal the underlying emotional tones present in these classic stories. Given the complex characters and moral dilemmas within the tales, VADER’s focus on both positive and negative sentiments will enhance the overall understanding of the narratives emotional landscape.
While trying to run VADER, I encountered several difficulties that slowed down the process significantly. Initially, the full dataset caused long processing times, likely due to its size and complexity, which made it challenging to manage and analyze efficiently. Additionally, I faced warnings regarding the data structure, such as the message indicating that the number of items to replace was not a multiple of the replacement length. This suggested there were mismatches in the expected data format, leading to further complications.
Given these challenges, I decided to simplify my approach and run the VADER sentiment analysis on only a sample of the data. By focusing on a smaller subset, I could more effectively troubleshoot any issues and achieve quicker results, ultimately allowing for a clearer understanding of the sentiment dynamics present in Grimm’s Fairy Tales without the overwhelming burden of processing the entire dataset at once.
knitr::opts_chunk$set(echo = TRUE)
# re-imported data so i can do it line by line instead of by word
#grimm_text <- readLines("C:/Users/tiffh/Downloads/Assignment#10/grimm.txt")
#grimm_df <- data.frame(text = grimm_text, stringsAsFactors = FALSE)
#print(head(grimm_df))
# subset of the first 100 lines for faster processing
#sample_grimm_df <- grimm_df[1:100, ]
# sentiment analysis on the sample
#vader_results_sample <- vader_df(sample_grimm_df)
#print(head(vader_results_sample))
#colnames(vader_results_sample)
#stats on smaple
#summary_statistics <- vader_results_sample %>%
#summarise(
# mean_positive = mean(pos, na.rm = TRUE),
# mean_negative = mean(neg, na.rm = TRUE),
# mean_neutral = mean(neu, na.rm = TRUE),
#mean_compound = mean(compound, na.rm = TRUE)
# )
#print(summary_statistics)
#plot
#sentiment_counts <- vader_results_sample %>%
# summarise(
# Positive = sum(pos > 0, na.rm = TRUE),
# Negative = sum(neg > 0, na.rm = TRUE),
# Neutral = sum(neu > 0, na.rm = TRUE)
# ) %>%
# pivot_longer(cols = everything(), names_to = "Sentiment", values_to = "Count")
# Plot the counts
#ggplot(sentiment_counts, aes(x = Sentiment, y = Count, fill = Sentiment)) +
# geom_bar(stat = "identity") +
# labs(title = "Counts of Positive, Negative, and Neutral Sentiments",
# x = "Sentiment Type",
# y = "Count") +
#theme_minimal()
In examining the sentiments uncover some intriguing insights. The average positive sentiment score is around 0.066, indicating a modest level of positivity in the narratives. In comparison, the mean negative sentiment score is slightly lower at 0.042, suggesting that negativity is less prevalent in these tales., which is surprising to me.
What stands out is the high neutral score of 0.892. This reflects a tendency for the language in the stories to be descriptive and straightforward, rather than heavily emotional. The overall compound score, which averages 0.054, supports this observation, indicating that while there is a hint of positivity, the overall sentiment leans more towards neutrality.
Overall, these findings provide a snapshot of the sentiment present in the sample analyzed, but they may not be representative of every story within Grimm’s Fairy Tales due to the limited nature of the sample.
Conclusion
This is my first time using lexicons, and it has been a highly insightful experience for word analysis. Exploring the various sentiment analysis packages available has opened my eyes to the diverse methods and tools at my disposal fo r understanding language in depth. The ability to quantify and visualize sentiments associated with specific words has enhanced my appreciation for the subtleties of text and provided valuable insights that can be applied to future analyses. I look forward to further exploring these resources and expanding my skills in text mining.
Reference
Silge, Julia, and David Robinson. “Sentiment Analysis.” Tidy Text Mining. Last modified August 21, 2023. https://www.tidytextmining.com/sentiment#most- positive-negative.
GeeksforGeeks. 2024. “Python Sentiment Analysis Using VADER.” Accessed November 2, 2024. https://www.geeksforgeeks.org/python-sentiment- analysis-using-vader/.
“VADER: Valence Aware Dictionary and sEntiment Reasoner.” 2024. Accessed November 2, 2024. https://cran.r-project.org/web/packages/vader/vader. pdf.cal skills in text mining.