Sentiment analysis
library(wordcloud)
library(reshape2)
library(janeaustenr)
library(tidytext)
library(lexicon)The Jane Austen analysis is reproduced courtesy of ORiely : Text Mining With R.
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License
We will use 3 general purpose sentiment lexicons
afinn_words<-get_sentiments("afinn") # words ranked from +5 to -5
bing_words<-get_sentiments("bing") # words are just negative and positive
nrc_words<-get_sentiments("nrc") ## words are categorized, anger, fear, etc..
nrc_words %>%
group_by(sentiment) %>%
summarize(n())## # A tibble: 10 x 2
## sentiment `n()`
## <chr> <int>
## 1 anger 1246
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3318
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
The unest_tokens function parses out each word into seperate rows so we can aggregate by specific words.
ungroup “ungroups” but what it really does is take the summary data and maps them to the original table
so every one of the 73K lines in the books gets a line# and a chapter
unnest_tokens parses each word in the text column and creates a word column turninging 73K records to 725K
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)The tidy_books data frame now contains 6 books with a seperate record for every word in it.
tidy_books %>%
group_by(book) %>%
summarize(n())## # A tibble: 6 x 2
## book `n()`
## <fct> <int>
## 1 Sense & Sensibility 119957
## 2 Pride & Prejudice 122204
## 3 Mansfield Park 160460
## 4 Emma 160996
## 5 Northanger Abbey 77780
## 6 Persuasion 83658
Populate the nrc_joy dataframe which contains the words associated with joy.
# get the "joy" words
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
Populate the nrc_joy dataframe which contains the words associated with joy.
# count the number of joy words in the book "Emma"
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## # A tibble: 301 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ... with 291 more rows
Transform tidy_books + bing sentiments.
Create a new field called index to break up the books into sections.
Use pivot_wider to transform the data.
tidy_books look like this
| Book | line | chapter | word |
|---|---|---|---|
| Sense & Sensibility | 1 | 0 | sense |
| Sense & Sensibility | 1 | 0 | and |
| Sense & Sensibility | 1 | 0 | sensibility |
The count function is adding an index column that represents an arbitrary section of 80 lines
| Book | index | sentiment | count |
|---|---|---|---|
| Sense & Sensibility | 73 | negative | 29 |
| Sense & Sensibility | 73 | positive | 21 |
pivot wider turns it to this
| Book | index | negative | positive |
|---|---|---|---|
| Sense & Sensibility | 73 | 29 | 21 |
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
Parse data into data frames.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Display a Bar Chart using geom_col to aggregate values, i.e. the miss record appears once with a value of 1855.
Brief review of bar chars:
geom_col uses stat_identity(), i.e. it leaves data as it is to aggregate values, used for categorical data
geom_bar() uses stat_count() and plots counts of categorical data
histograms plot counts of continuous variables categorized in bins
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Bind the rows with the stop_words dataset. Then do an “anti_join” to remove them.
Word Clouds (also known as wordle, word collage or tag cloud) are visual representations of words that give greater prominence to words that appear more frequently.
# stop words is a data set in the tidytext package
# A data frame with 1149 rows and 2 variables:
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))The acast function is part of the reshape2 package, which is just a newer version of the reshape package.
acast and dcast are related functions to cast the data into arrays or dataframes.
comparison.cloud is part of the wordcloud package and compares the word frequency among to data sets.
In this case the sentiment field is either positive or negative.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)Moving on to Sentences.
Print out a couple of example sentences to show what unnest_tokens did.
# prideprejudice is a dataset included in the janeaustenr package
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
for (i in 140:144) {
print(p_and_p_sentences$sentence[i])
}## [1] "\"i do not believe mrs."
## [1] "long will do any such thing."
## [1] "she has two nieces"
## [1] "of her own."
## [1] "she is a selfish, hypocritical woman, and i have no opinion"
Use unnest_tokens to group within a regex (the chapters).
note: austen_books() is just text and book
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
Isolate the counts of negative words only and display the ratios to the overall word counts
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
In this section we will download a book from the Gutenberg project.
We will look at Ernest Hemingway’s first book. In our Time
Its not a perfect comparison. I chose Hemingway because I have some thoughts about his style.
Lets acknowledge the differences in our authors.
| Author | Nationality | Born | Died | Sample Size |
|---|---|---|---|---|
| Jane Austen | English | 1775 | 1817 | 6 Books |
| Ernest Hemingway | American | 1899 | 1961 | 1 Book |
We will use the lexicon package and run several comparisons between Austen and Hemingway.
Most of the datasets seeme to be more appropriate for modern usage.
kable(lexicon::available_data() , caption="lexicon datasets",row.names = FALSE, booktabs=TRUE, table.attr = "style='width:80%;'") %>%
kable_styling(font_size = 8)| Data | Description |
|---|---|
| cliches | Common Cliches |
| common_names | First Names (U.S.) |
| constraining_loughran_mcdonald | Loughran-McDonald Constraining Words |
| emojis_sentiment | Emoji Sentiment Data |
| freq_first_names | Frequent U.S. First Names |
| freq_last_names | Frequent U.S. Last Names |
| function_words | Function Words |
| grady_augmented | Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List |
| hash_emojis | Emoji Description Lookup Table |
| hash_emojis_identifier | Emoji Identifier Lookup Table |
| hash_emoticons | Emoticons |
| hash_grady_pos | Grady Ward’s Moby Parts of Speech |
| hash_internet_slang | List of Internet Slang and Corresponding Meanings |
| hash_lemmas | Lemmatization List |
| hash_nrc_emotions | NRC Emotion Table |
| hash_sentiment_emojis | Emoji Sentiment Polarity Lookup Table |
| hash_sentiment_huliu | Hu Liu Polarity Lookup Table |
| hash_sentiment_jockers | Jockers Sentiment Polarity Table |
| hash_sentiment_jockers_rinker | Combined Jockers & Rinker Polarity Lookup Table |
| hash_sentiment_loughran_mcdonald | Loughran-McDonald Polarity Table |
| hash_sentiment_nrc | NRC Sentiment Polarity Table |
| hash_sentiment_senticnet | Augmented SenticNet Polarity Table |
| hash_sentiment_sentiword | Augmented Sentiword Polarity Table |
| hash_sentiment_slangsd | SlangSD Sentiment Polarity Table |
| hash_sentiment_socal_google | SO-CAL Google Polarity Table |
| hash_valence_shifters | Valence Shifters |
| key_contractions | Contraction Conversions |
| key_corporate_social_responsibility | Nadra Pencle and Irina Malaescu’s Corporate Social Responsibility Dictionary |
| key_grade | Grades Data Set |
| key_rating | Ratings Data Set |
| key_regressive_imagery | Colin Martindale’s English Regressive Imagery Dictionary |
| key_sentiment_jockers | Jockers Sentiment Data Set |
| modal_loughran_mcdonald | Loughran-McDonald Modal List |
| nrc_emotions | NRC Emotions |
| pos_action_verb | Action Word List |
| pos_df_irregular_nouns | Irregular Nouns Word Dataframe |
| pos_df_pronouns | Pronouns |
| pos_interjections | Interjections |
| pos_preposition | Preposition Words |
| profanity_alvarez | Alejandro U. Alvarez’s List of Profane Words |
| profanity_arr_bad | Stackoverflow user2592414’s List of Profane Words |
| profanity_banned | bannedwordlist.com’s List of Profane Words |
| profanity_racist | Titus Wormer’s List of Racist Words |
| profanity_zac_anger | Zac Anger’s List of Profane Words |
| sw_dolch | Leveled Dolch List of 220 Common Words |
| sw_fry_100 | Fry’s 100 Most Commonly Used English Words |
| sw_fry_1000 | Fry’s 1000 Most Commonly Used English Words |
| sw_fry_200 | Fry’s 200 Most Commonly Used English Words |
| sw_fry_25 | Fry’s 25 Most Commonly Used English Words |
| sw_jockers | Matthew Jocker’s Expanded Topic Modeling Stopword List |
| sw_loughran_mcdonald_long | Loughran-McDonald Long Stopword List |
| sw_loughran_mcdonald_short | Loughran-McDonald Short Stopword List |
| sw_lucene | Lucene Stopword List |
| sw_mallet | MALLET Stopword List |
| sw_python | Python Stopword List |
The action verb dataset is a good one for our purposes.
verbs_df<-data.frame(word=pos_action_verb)Download data
destfile<-"hemingway.txt"
url <- "https://www.gutenberg.org/files/61085/61085-0.txt"
raw_text <-read.fwf(url, width=1000 )Tidy the Data.
The Hemingway file has a lot of non ascii characters, and blank lines and extra verbiage.
We will use gsub to remove anything not in the range of ascii characters.
The ascci character starts at hexadecimal 20 (space) and ends at hexadecimal 7E (tilde)
and includes all alphanumeric characters and punctuation marks.
raw_text2 <- na.omit(raw_text)
colnames(raw_text2)<-"V1"
raw_text3<-data.frame(gsub("[^\x20-\x7E]", "", raw_text2$V1))
hemmingway_df <- data.frame(
text =character()
)
keep=0
for(i in 1:nrow(raw_text3)) {
if (str_detect(raw_text3[i,], "Here ends _The Inquest_")) { # end here
keep=0
}
if (keep==1) {
hemmingway_df<-rbind(hemmingway_df,as.data.frame(raw_text3[i,]))
}
if (str_detect(raw_text3[i,], "chapter 1")) { # start here
keep=1
}
}
colnames(hemmingway_df)<-"text"Seperate the words in rows.
hemmingway_words_df <- hemmingway_df %>%
unnest_tokens(word, text)Isolate the 20 most common verbs from both authors.
verb_count_eh<-hemmingway_words_df %>%
inner_join(verbs_df) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
head(n=20)
verb_count_ja<-tidy_books %>%
inner_join(verbs_df) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
head(n=20)Display the verb counts side by side.
plot1<-verb_count_ja %>% ggplot(aes(y=word, x=n)) +
geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title='Jane Austen')
plot2<-verb_count_eh %>% ggplot(aes(y=word, x=n)) +
geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title='Ernest Hemingway')
grid.arrange(plot1, plot2, ncol = 2)I would expect the themes of romance and war to be juxtaposed here. Its interesting to see think, hope and wish on the left and words like kill, face and forward on the right.
Extra question.
I’ve always admired Hemingway for his simple prose. Id like to compare the length of his words to those of Jane Austen.
mean(nchar(hemmingway_words_df$word))## [1] 4.098023
mean(nchar(tidy_books$word))## [1] 4.344304
Notes: This is an interesting baseline to explore the evolution of the language of literature over time.
Having only one short book by Hemingway is not sufficient but it was the only Hemingway book that I could find.
miss and man show up as verbs. Im pretty sure they are mostly not verbs.