Sentiment analysis

library(wordcloud)
library(reshape2)
library(janeaustenr)
library(tidytext)
library(lexicon)

Analysis of Jane Austin

The Jane Austen analysis is reproduced courtesy of ORiely : Text Mining With R.

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License

Text Mining with R : Chapter 2

We will use 3 general purpose sentiment lexicons

afinn_words<-get_sentiments("afinn")       # words ranked from +5 to -5

bing_words<-get_sentiments("bing")         # words are just negative and positive

nrc_words<-get_sentiments("nrc")    ## words are categorized, anger, fear, etc..

nrc_words %>%
  group_by(sentiment) %>%
  summarize(n())

## # A tibble: 10 x 2
##    sentiment    `n()`
##    <chr>        <int>
##  1 anger         1246
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3318
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230

The unest_tokens function parses out each word into seperate rows so we can aggregate by specific words.

ungroup “ungroups” but what it really does is take the summary data and maps them to the original table
so every one of the 73K lines in the books gets a line# and a chapter
unnest_tokens parses each word in the text column and creates a word column turninging 73K records to 725K

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

The tidy_books data frame now contains 6 books with a seperate record for every word in it.

tidy_books %>%
  group_by(book) %>%
  summarize(n())

## # A tibble: 6 x 2
##   book                 `n()`
##   <fct>                <int>
## 1 Sense & Sensibility 119957
## 2 Pride & Prejudice   122204
## 3 Mansfield Park      160460
## 4 Emma                160996
## 5 Northanger Abbey     77780
## 6 Persuasion           83658

Populate the nrc_joy dataframe which contains the words associated with joy.

# get the "joy" words
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

Populate the nrc_joy dataframe which contains the words associated with joy.

# count the number of joy words in the book "Emma"
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows

Transform tidy_books + bing sentiments.
Create a new field called index to break up the books into sections.
Use pivot_wider to transform the data.

tidy_books look like this

Book	line	word
Sense & Sensibility	1	sense
Sense & Sensibility	1	and
Sense & Sensibility	1	sensibility

The count function is adding an index column that represents an arbitrary section of 80 lines

Book	index	sentiment	count
Sense & Sensibility	73	negative	29
Sense & Sensibility	73	positive	21

pivot wider turns it to this

Book	index	negative	positive
Sense & Sensibility	73	29	21

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

Parse data into data frames.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Display a Bar Chart using geom_col to aggregate values, i.e. the miss record appears once with a value of 1855.


Brief review of bar chars:
        geom_col uses stat_identity(), i.e. it leaves data as it is to aggregate values, used for categorical data
        geom_bar() uses stat_count() and plots counts of categorical data 
        histograms plot counts of continuous variables categorized in bins

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Bind the rows with the stop_words dataset. Then do an “anti_join” to remove them.
Word Clouds (also known as wordle, word collage or tag cloud) are visual representations of words that give greater prominence to words that appear more frequently.

# stop words is a data set in the tidytext package 
# A data frame with 1149 rows and 2 variables:
custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)


tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

The acast function is part of the reshape2 package, which is just a newer version of the reshape package.

acast and dcast are related functions to cast the data into arrays or dataframes.
comparison.cloud is part of the wordcloud package and compares the word frequency among to data sets.
In this case the sentiment field is either positive or negative.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Moving on to Sentences.

Print out a couple of example sentences to show what unnest_tokens did.

#  prideprejudice is a dataset included in the janeaustenr package

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")


for (i in 140:144) {
  print(p_and_p_sentences$sentence[i])
}

## [1] "\"i do not believe mrs."
## [1] "long will do any such thing."
## [1] "she has two nieces"
## [1] "of her own."
## [1] "she is a selfish, hypocritical woman, and i have no opinion"

Use unnest_tokens to group within a regex (the chapters).

note: austen_books() is just text and book

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

Isolate the counts of negative words only and display the ratios to the overall word counts

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Analysis of Ernest Hemingway

In this section we will download a book from the Gutenberg project.

We will look at Ernest Hemingway’s first book. In our Time

Its not a perfect comparison. I chose Hemingway because I have some thoughts about his style.

Lets acknowledge the differences in our authors.

Author	Nationality	Born	Died	Sample Size
Jane Austen	English	1775	1817	6 Books
Ernest Hemingway	American	1899	1961	1 Book

We will use the lexicon package and run several comparisons between Austen and Hemingway.
Most of the datasets seeme to be more appropriate for modern usage.

kable(lexicon::available_data() , caption="lexicon datasets",row.names = FALSE, booktabs=TRUE, table.attr = "style='width:80%;'") %>%
  kable_styling(font_size = 8)

lexicon datasets
Data	Description
cliches	Common Cliches
common_names	First Names (U.S.)
constraining_loughran_mcdonald	Loughran-McDonald Constraining Words
emojis_sentiment	Emoji Sentiment Data
freq_first_names	Frequent U.S. First Names
freq_last_names	Frequent U.S. Last Names
function_words	Function Words
grady_augmented	Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List
hash_emojis	Emoji Description Lookup Table
hash_emojis_identifier	Emoji Identifier Lookup Table
hash_emoticons	Emoticons
hash_grady_pos	Grady Ward’s Moby Parts of Speech
hash_internet_slang	List of Internet Slang and Corresponding Meanings
hash_lemmas	Lemmatization List
hash_nrc_emotions	NRC Emotion Table
hash_sentiment_emojis	Emoji Sentiment Polarity Lookup Table
hash_sentiment_huliu	Hu Liu Polarity Lookup Table
hash_sentiment_jockers	Jockers Sentiment Polarity Table
hash_sentiment_jockers_rinker	Combined Jockers & Rinker Polarity Lookup Table
hash_sentiment_loughran_mcdonald	Loughran-McDonald Polarity Table
hash_sentiment_nrc	NRC Sentiment Polarity Table
hash_sentiment_senticnet	Augmented SenticNet Polarity Table
hash_sentiment_sentiword	Augmented Sentiword Polarity Table
hash_sentiment_slangsd	SlangSD Sentiment Polarity Table
hash_sentiment_socal_google	SO-CAL Google Polarity Table
hash_valence_shifters	Valence Shifters
key_contractions	Contraction Conversions
key_corporate_social_responsibility	Nadra Pencle and Irina Malaescu’s Corporate Social Responsibility Dictionary
key_grade	Grades Data Set
key_rating	Ratings Data Set
key_regressive_imagery	Colin Martindale’s English Regressive Imagery Dictionary
key_sentiment_jockers	Jockers Sentiment Data Set
modal_loughran_mcdonald	Loughran-McDonald Modal List
nrc_emotions	NRC Emotions
pos_action_verb	Action Word List
pos_df_irregular_nouns	Irregular Nouns Word Dataframe
pos_df_pronouns	Pronouns
pos_interjections	Interjections
pos_preposition	Preposition Words
profanity_alvarez	Alejandro U. Alvarez’s List of Profane Words
profanity_arr_bad	Stackoverflow user2592414’s List of Profane Words
profanity_banned	bannedwordlist.com’s List of Profane Words
profanity_racist	Titus Wormer’s List of Racist Words
profanity_zac_anger	Zac Anger’s List of Profane Words
sw_dolch	Leveled Dolch List of 220 Common Words
sw_fry_100	Fry’s 100 Most Commonly Used English Words
sw_fry_1000	Fry’s 1000 Most Commonly Used English Words
sw_fry_200	Fry’s 200 Most Commonly Used English Words
sw_fry_25	Fry’s 25 Most Commonly Used English Words
sw_jockers	Matthew Jocker’s Expanded Topic Modeling Stopword List
sw_loughran_mcdonald_long	Loughran-McDonald Long Stopword List
sw_loughran_mcdonald_short	Loughran-McDonald Short Stopword List
sw_lucene	Lucene Stopword List
sw_mallet	MALLET Stopword List
sw_python	Python Stopword List

The action verb dataset is a good one for our purposes.

verbs_df<-data.frame(word=pos_action_verb)

Download data

destfile<-"hemingway.txt"

url <- "https://www.gutenberg.org/files/61085/61085-0.txt"

raw_text <-read.fwf(url, width=1000 )

Tidy the Data.

The Hemingway file has a lot of non ascii characters, and blank lines and extra verbiage.
We will use gsub to remove anything not in the range of ascii characters.
The ascci character starts at hexadecimal 20 (space) and ends at hexadecimal 7E (tilde)
and includes all alphanumeric characters and punctuation marks.

raw_text2 <- na.omit(raw_text)
colnames(raw_text2)<-"V1"
raw_text3<-data.frame(gsub("[^\x20-\x7E]", "", raw_text2$V1))



hemmingway_df <- data.frame(
  text     =character()
)


keep=0



for(i in 1:nrow(raw_text3)) {     

  
  if (str_detect(raw_text3[i,], "Here ends _The Inquest_")) {  # end here
    keep=0
  }
  
  
  if (keep==1) {
    hemmingway_df<-rbind(hemmingway_df,as.data.frame(raw_text3[i,]))
  }
  
  
  if (str_detect(raw_text3[i,], "chapter 1")) {   # start here
    keep=1
  }

}


colnames(hemmingway_df)<-"text"

Seperate the words in rows.

hemmingway_words_df <- hemmingway_df %>%
  unnest_tokens(word, text)

Isolate the 20 most common verbs from both authors.

verb_count_eh<-hemmingway_words_df %>%
  inner_join(verbs_df) %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  head(n=20)


verb_count_ja<-tidy_books %>%
  inner_join(verbs_df) %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  head(n=20)

Display the verb counts side by side.

plot1<-verb_count_ja %>% ggplot(aes(y=word, x=n)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Jane Austen')



plot2<-verb_count_eh %>% ggplot(aes(y=word, x=n)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Ernest Hemingway')

grid.arrange(plot1, plot2, ncol = 2)

I would expect the themes of romance and war to be juxtaposed here. Its interesting to see think, hope and wish on the left and words like kill, face and forward on the right.

Extra question.

I’ve always admired Hemingway for his simple prose. Id like to compare the length of his words to those of Jane Austen.

mean(nchar(hemmingway_words_df$word))

## [1] 4.098023

mean(nchar(tidy_books$word))

## [1] 4.344304

Notes: This is an interesting baseline to explore the evolution of the language of literature over time.
Having only one short book by Hemingway is not sufficient but it was the only Hemingway book that I could find.
miss and man show up as verbs. Im pretty sure they are mostly not verbs.

Links

Great R Packages for NLP

20 Open Datasets for NLP

References

[7] Julia Silge and David Robinson. “ Welcome to Text Mining with R ”. ORiely

Data607_Week10

Tom Buonora

2021-10-30

Analysis of Jane Austin

Text Mining with R : Chapter 2

Analysis of Ernest Hemingway

Links

References