library(tidytext)
## Warning: package 'tidytext' was built under R version 4.2.3
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.2.3
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(textdata)
## Warning: package 'textdata' was built under R version 4.2.3
Please note the base code for this project was taken from “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. Sebastopol, CA: O’Reilly Media, 2017. ISBN 978-1-491-98165-8. Code below in named chunks “silge-robinson” and “silge-robinson2”.
This first R chunk code shows how to “mine” words from Jane Austen’s bibliography, with each observation in the dataframe a unique Book, Chapter, Line Number, and Word combination.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
This second R chunk shows how to load three sentiment lexicons: afinn, bing, and NRC which I’d also like to cite (linked below). - afinn - bing - nrc
afinn <- get_sentiments('afinn')
bing <- get_sentiments('bing')
nrc <- get_sentiments('nrc')
I will use this base as inspiration for my own analysis.
I will look at a the full available Jane Austen corpus to extend this code, evaluating which of the author’s books have the highest percentage of neutral, positive, and negative words.
Showing word count by book. Sense & Sensibility is the longest book. Persuasion is the shortest.
austen_word_counts <- as.data.frame(table(tidy_books$book))
colnames(austen_word_counts) <- c("book","total_word_count")
austen_word_counts
## book total_word_count
## 1 Sense & Sensibility 119957
## 2 Pride & Prejudice 122204
## 3 Mansfield Park 160460
## 4 Emma 160996
## 5 Northanger Abbey 77780
## 6 Persuasion 83658
Taking positive and negative word lexicon from nrc which I’ll use to evaluate Jane Austen’s corpus. Merging with tidy_books df created previously, then showing positive and negative word count by book.
nrc_pos_neg <- nrc %>%
filter(sentiment == 'negative'| sentiment=='positive')
austen_sentiment <- tidy_books %>%
inner_join(nrc_pos_neg) %>%
group_by(book,sentiment) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame() %>%
pivot_wider(names_from = sentiment,
values_from = total_count)
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc_pos_neg): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 251 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
colnames(austen_sentiment) <- c("book","negative_words","positive_words")
austen_sentiment
## # A tibble: 6 × 3
## book negative_words positive_words
## <fct> <int> <int>
## 1 Sense & Sensibility 3994 7314
## 2 Pride & Prejudice 3630 7273
## 3 Mansfield Park 4694 9269
## 4 Emma 4438 9262
## 5 Northanger Abbey 2456 4520
## 6 Persuasion 2226 5248
Merging total word count df with positive and negative word count df to assess % of all words which were positive or negative by book.
austen_combined <- austen_word_counts %>%
inner_join(austen_sentiment) %>%
transform(perc_neg = scales::percent(negative_words / total_word_count),
perc_pos = scales::percent(positive_words / total_word_count),
neutral_words = total_word_count - negative_words - positive_words)
## Joining with `by = join_by(book)`
austen_combined <- austen_combined %>%
transform(perc_neutr = scales::percent(neutral_words / total_word_count))
austen_final <- austen_combined[,c('book','total_word_count','positive_words','negative_words','neutral_words','perc_pos','perc_neg','perc_neutr')]
austen_final
## book total_word_count positive_words negative_words
## 1 Sense & Sensibility 119957 7314 3994
## 2 Pride & Prejudice 122204 7273 3630
## 3 Mansfield Park 160460 9269 4694
## 4 Emma 160996 9262 4438
## 5 Northanger Abbey 77780 4520 2456
## 6 Persuasion 83658 5248 2226
## neutral_words perc_pos perc_neg perc_neutr
## 1 108649 6.097% 3.330% 90.573%
## 2 111301 5.952% 2.970% 91.078%
## 3 146497 5.777% 2.925% 91.298%
## 4 147296 5.753% 2.757% 91.490%
## 5 70804 5.811% 3.158% 91.031%
## 6 76184 6.273% 2.661% 91.066%
The book with the highest percentage of positive words was: Persuasion. It also had the lowest percentage of negative words.
austen_final %>% arrange(desc(perc_pos))
## book total_word_count positive_words negative_words
## 1 Persuasion 83658 5248 2226
## 2 Sense & Sensibility 119957 7314 3994
## 3 Pride & Prejudice 122204 7273 3630
## 4 Northanger Abbey 77780 4520 2456
## 5 Mansfield Park 160460 9269 4694
## 6 Emma 160996 9262 4438
## neutral_words perc_pos perc_neg perc_neutr
## 1 76184 6.273% 2.661% 91.066%
## 2 108649 6.097% 3.330% 90.573%
## 3 111301 5.952% 2.970% 91.078%
## 4 70804 5.811% 3.158% 91.031%
## 5 146497 5.777% 2.925% 91.298%
## 6 147296 5.753% 2.757% 91.490%
The book with the highest percentage of negative words was “Sense & Sensibility” which is interesting because it also had the second highest percentage of positive words. It therefore had fewer “neutral” words, making it a more emotionally charged book than the others.
austen_final %>% arrange(desc(perc_neg))
## book total_word_count positive_words negative_words
## 1 Sense & Sensibility 119957 7314 3994
## 2 Northanger Abbey 77780 4520 2456
## 3 Pride & Prejudice 122204 7273 3630
## 4 Mansfield Park 160460 9269 4694
## 5 Emma 160996 9262 4438
## 6 Persuasion 83658 5248 2226
## neutral_words perc_pos perc_neg perc_neutr
## 1 108649 6.097% 3.330% 90.573%
## 2 70804 5.811% 3.158% 91.031%
## 3 111301 5.952% 2.970% 91.078%
## 4 146497 5.777% 2.925% 91.298%
## 5 147296 5.753% 2.757% 91.490%
## 6 76184 6.273% 2.661% 91.066%
Per the assignment description, I will now perform an additional sentiment analysis using a different corpus and different lexicon than previously mentioned.
For the lexicon I will use SentimentAnalysis.R package description linked here.
For the corpus I will use the “friends” package from R where each observation is a piece of speech said by a character in the TV show Friends.
library(SentimentAnalysis)
## Warning: package 'SentimentAnalysis' was built under R version 4.2.3
##
## Attaching package: 'SentimentAnalysis'
## The following object is masked from 'package:base':
##
## write
library(friends)
## Warning: package 'friends' was built under R version 4.2.3
friends_lines <- friends
As the value count generated below shows, there are 699 characters who have speaking lines over the course of Friends. To ensure an adequate sample size, I’ll only look at the six main characters who account for the vast majority of lines: Monica Geller, Rachel Green, Ross Geller, Chandler Bing, Joey Tribbiani, and Phoebe Buffay.
Per this dataframe, the characters with the most lines are: Rachel Green (9312), Ross Geller (9157), and Chandler Bing (8465).
length(table(friends_lines$speaker))
## [1] 699
talkers <- as.data.frame(table(friends_lines$speaker))
top_10_talkers <- talkers %>%
slice_max(order_by = Freq, n = 10)
names(top_10_talkers) <- c('character','speaking_lines')
top_10_talkers
## character speaking_lines
## 1 Rachel Green 9312
## 2 Ross Geller 9157
## 3 Chandler Bing 8465
## 4 Monica Geller 8441
## 5 Joey Tribbiani 8171
## 6 Phoebe Buffay 7501
## 7 Scene Directions 6063
## 8 #ALL# 347
## 9 Mike Hannigan 330
## 10 Richard Burke 281
Only top 6 characters in terms of lines spoken.
friends_lines <- friends_lines %>%
filter(speaker %in% c("Monica Geller", "Rachel Green", "Ross Geller", "Chandler Bing", "Joey Tribbiani", "Phoebe Buffay"))
Splitting each row into multiple rows, where each word is its own row (to prepare for sentiment analysis). Delimeter between words is space ” “.
friends_words <- friends_lines %>%
separate_rows(text, sep = " ")
Showing which friends characters spoke the most, in terms of lines, words, and words per line. Monica has the fewest words per line (9.8) while Phoebe has the most (10.9). Rachel and Ross have the most lines overall (9312 and 9157) which drives their top total word count (97,633 and 95,561) among all characters.
friends_word_count <- friends_words %>%
group_by(speaker) %>%
summarise(total_word_count=n(),
.groups = 'drop')
top_talkers_lines <- top_10_talkers %>%
filter(character %in% c("Monica Geller", "Rachel Green", "Ross Geller", "Chandler Bing", "Joey Tribbiani", "Phoebe Buffay"))
friends_summary <- left_join(friends_word_count, top_talkers_lines, by=c('speaker'='character'))
friends_summary <- friends_summary %>%
transform(words_per_line = round((total_word_count / speaking_lines),2))
friends_summary
## speaker total_word_count speaking_lines words_per_line
## 1 Chandler Bing 86547 8465 10.22
## 2 Joey Tribbiani 86426 8171 10.58
## 3 Monica Geller 82988 8441 9.83
## 4 Phoebe Buffay 81506 7501 10.87
## 5 Rachel Green 97633 9312 10.48
## 6 Ross Geller 95561 9157 10.44
I first attempted to perform sentiment analysis using the analyzeSentiment function in the SentimentAnalysis package. However that function is too slow to work well on individual words. EG friends_words.sentiment <- analyzeSentiment(friends_words.text). It’s designed for smaller samples (e.g. a few paragraphs).
Therefore I’ll change the SentimentAnalysis’ word dictionary (“DictionaryGI”) to a dataframe, and perform an analysis by joining the word dataframe with the sentiment dataframe, as we did with the Jane Austen data.
Preparing sentiment df.
data(DictionaryGI)
str(DictionaryGI)
## List of 2
## $ negative: chr [1:2005] "abandon" "abandonment" "abate" "abdicate" ...
## $ positive: chr [1:1637] "abide" "ability" "able" "abound" ...
length(DictionaryGI$positive) <- length(DictionaryGI$negative)
sa_df <- as.data.frame(DictionaryGI)
neg_words <- sa_df$negative
neg_words <- as.data.frame(neg_words)
neg_words <- neg_words %>%
mutate(sentiment = "negative")
names(neg_words) <- c('text','sentiment')
pos_words <- sa_df$positive
pos_words <- as.data.frame(pos_words)
pos_words <- pos_words %>%
mutate(sentiment="positive")
names(pos_words) <- c('text','sentiment')
sa_dict <- bind_rows(pos_words, neg_words)
friends_sentiment <- friends_words %>%
inner_join(sa_dict) %>%
group_by(speaker,sentiment) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame() %>%
pivot_wider(names_from = sentiment,
values_from = total_count)
## Joining with `by = join_by(text)`
## Warning in inner_join(., sa_dict): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 69 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
colnames(friends_sentiment) <- c("speaker","negative_words","positive_words")
Showing count of positive and negative words used by Friends characters.
friends_sentiment
## # A tibble: 6 × 3
## speaker negative_words positive_words
## <chr> <int> <int>
## 1 Chandler Bing 2299 3512
## 2 Joey Tribbiani 2288 3564
## 3 Monica Geller 2303 3335
## 4 Phoebe Buffay 2030 3564
## 5 Rachel Green 2477 4321
## 6 Ross Geller 2329 3822
Merging total Friends word count df with positive and negative word count df to assess % of all words which were positive or negative by character.
friends_combined <- friends_summary %>%
inner_join(friends_sentiment) %>%
transform(perc_neg = scales::percent(negative_words / total_word_count),
perc_pos = scales::percent(positive_words / total_word_count),
neutral_words = total_word_count - negative_words - positive_words)
## Joining with `by = join_by(speaker)`
friends_combined <- friends_combined %>%
transform(perc_neutr = scales::percent(neutral_words / total_word_count))
friends_final <- friends_combined[,c('speaker','total_word_count','speaking_lines','words_per_line','positive_words','negative_words','neutral_words','perc_pos','perc_neg','perc_neutr')]
In general, this dataframe shows the characters with a relatively narrow band of sentiment: from 4.00% positive (Ross) to 4.43% positive (Phoebe) and from 2.43% negative (Ross) to 2.78% negative (Monica). Neutrality ranged from 93.04% (Rachel) to 93.56% (Ross).
These data show Rachel as the most sentimental character, Ross as the least sentimental, Phoebe as the most positive, and Monica as the most negative. All of this aligns with my domain knowledge, therefore the sentiment analysis appears successful.
friends_final
## speaker total_word_count speaking_lines words_per_line positive_words
## 1 Chandler Bing 86547 8465 10.22 3512
## 2 Joey Tribbiani 86426 8171 10.58 3564
## 3 Monica Geller 82988 8441 9.83 3335
## 4 Phoebe Buffay 81506 7501 10.87 3564
## 5 Rachel Green 97633 9312 10.48 4321
## 6 Ross Geller 95561 9157 10.44 3822
## negative_words neutral_words perc_pos perc_neg perc_neutr
## 1 2299 80736 4.058% 2.6564% 93.286%
## 2 2288 80574 4.124% 2.6474% 93.229%
## 3 2303 77350 4.019% 2.7751% 93.206%
## 4 2030 75912 4.373% 2.4906% 93.137%
## 5 2477 90835 4.426% 2.5371% 93.037%
## 6 2329 89410 4.000% 2.4372% 93.563%