Sentiment analysis


library(wordcloud)
library(reshape2)
library(janeaustenr)
library(tidytext)
library(lexicon)






Analysis of Jane Austin






The Jane Austen analysis is reproduced courtesy of ORiely : Text Mining With R.

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License



Text Mining with R : Chapter 2



We will use 3 general purpose sentiment lexicons


afinn_words<-get_sentiments("afinn")       # words ranked from +5 to -5

bing_words<-get_sentiments("bing")         # words are just negative and positive

nrc_words<-get_sentiments("nrc")    ## words are categorized, anger, fear, etc..

nrc_words %>%
  group_by(sentiment) %>%
  summarize(n())
## # A tibble: 10 x 2
##    sentiment    `n()`
##    <chr>        <int>
##  1 anger         1246
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3318
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230





The unest_tokens function parses out each word into seperate rows so we can aggregate by specific words.


ungroup “ungroups” but what it really does is take the summary data and maps them to the original table
so every one of the 73K lines in the books gets a line# and a chapter
unnest_tokens parses each word in the text column and creates a word column turninging 73K records to 725K

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)





The tidy_books data frame now contains 6 books with a seperate record for every word in it.


tidy_books %>%
  group_by(book) %>%
  summarize(n())
## # A tibble: 6 x 2
##   book                 `n()`
##   <fct>                <int>
## 1 Sense & Sensibility 119957
## 2 Pride & Prejudice   122204
## 3 Mansfield Park      160460
## 4 Emma                160996
## 5 Northanger Abbey     77780
## 6 Persuasion           83658





Populate the nrc_joy dataframe which contains the words associated with joy.


# get the "joy" words
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")





Populate the nrc_joy dataframe which contains the words associated with joy.

# count the number of joy words in the book "Emma"
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows





Transform tidy_books + bing sentiments.
Create a new field called index to break up the books into sections.
Use pivot_wider to transform the data.


tidy_books look like this

Book line chapter word
Sense & Sensibility 1 0 sense
Sense & Sensibility 1 0 and
Sense & Sensibility 1 0 sensibility


The count function is adding an index column that represents an arbitrary section of 80 lines

Book index sentiment count
Sense & Sensibility 73 negative 29
Sense & Sensibility 73 positive 21


pivot wider turns it to this

Book index negative positive
Sense & Sensibility 73 29 21


jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)





Parse data into data frames.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")





Display a Bar Chart using geom_col to aggregate values, i.e. the miss record appears once with a value of 1855.


Brief review of bar chars:
        geom_col uses stat_identity(), i.e. it leaves data as it is to aggregate values, used for categorical data
        geom_bar() uses stat_count() and plots counts of categorical data 
        histograms plot counts of continuous variables categorized in bins

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)





Bind the rows with the stop_words dataset. Then do an “anti_join” to remove them.
Word Clouds (also known as wordle, word collage or tag cloud) are visual representations of words that give greater prominence to words that appear more frequently.

# stop words is a data set in the tidytext package 
# A data frame with 1149 rows and 2 variables:
custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)


tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))





The acast function is part of the reshape2 package, which is just a newer version of the reshape package.


acast and dcast are related functions to cast the data into arrays or dataframes.
comparison.cloud is part of the wordcloud package and compares the word frequency among to data sets.
In this case the sentiment field is either positive or negative.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)





Moving on to Sentences.


Print out a couple of example sentences to show what unnest_tokens did.


#  prideprejudice is a dataset included in the janeaustenr package

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")


for (i in 140:144) {
  print(p_and_p_sentences$sentence[i])
}
## [1] "\"i do not believe mrs."
## [1] "long will do any such thing."
## [1] "she has two nieces"
## [1] "of her own."
## [1] "she is a selfish, hypocritical woman, and i have no opinion"





Use unnest_tokens to group within a regex (the chapters).


note: austen_books() is just text and book

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25





Isolate the counts of negative words only and display the ratios to the overall word counts


bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343






Analysis of Ernest Hemingway


In this section we will download a book from the Gutenberg project.


We will look at Ernest Hemingway’s first book. In our Time


Its not a perfect comparison. I chose Hemingway because I have some thoughts about his style.


Lets acknowledge the differences in our authors.

Author Nationality Born Died Sample Size
Jane Austen English 1775 1817 6 Books
Ernest Hemingway American 1899 1961 1 Book

We will use the lexicon package and run several comparisons between Austen and Hemingway.
Most of the datasets seeme to be more appropriate for modern usage.

kable(lexicon::available_data() , caption="lexicon datasets",row.names = FALSE, booktabs=TRUE, table.attr = "style='width:80%;'") %>%
  kable_styling(font_size = 8)
lexicon datasets
Data Description
cliches Common Cliches
common_names First Names (U.S.)
constraining_loughran_mcdonald Loughran-McDonald Constraining Words
emojis_sentiment Emoji Sentiment Data
freq_first_names Frequent U.S. First Names
freq_last_names Frequent U.S. Last Names
function_words Function Words
grady_augmented Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List
hash_emojis Emoji Description Lookup Table
hash_emojis_identifier Emoji Identifier Lookup Table
hash_emoticons Emoticons
hash_grady_pos Grady Ward’s Moby Parts of Speech
hash_internet_slang List of Internet Slang and Corresponding Meanings
hash_lemmas Lemmatization List
hash_nrc_emotions NRC Emotion Table
hash_sentiment_emojis Emoji Sentiment Polarity Lookup Table
hash_sentiment_huliu Hu Liu Polarity Lookup Table
hash_sentiment_jockers Jockers Sentiment Polarity Table
hash_sentiment_jockers_rinker Combined Jockers & Rinker Polarity Lookup Table
hash_sentiment_loughran_mcdonald Loughran-McDonald Polarity Table
hash_sentiment_nrc NRC Sentiment Polarity Table
hash_sentiment_senticnet Augmented SenticNet Polarity Table
hash_sentiment_sentiword Augmented Sentiword Polarity Table
hash_sentiment_slangsd SlangSD Sentiment Polarity Table
hash_sentiment_socal_google SO-CAL Google Polarity Table
hash_valence_shifters Valence Shifters
key_contractions Contraction Conversions
key_corporate_social_responsibility Nadra Pencle and Irina Malaescu’s Corporate Social Responsibility Dictionary
key_grade Grades Data Set
key_rating Ratings Data Set
key_regressive_imagery Colin Martindale’s English Regressive Imagery Dictionary
key_sentiment_jockers Jockers Sentiment Data Set
modal_loughran_mcdonald Loughran-McDonald Modal List
nrc_emotions NRC Emotions
pos_action_verb Action Word List
pos_df_irregular_nouns Irregular Nouns Word Dataframe
pos_df_pronouns Pronouns
pos_interjections Interjections
pos_preposition Preposition Words
profanity_alvarez Alejandro U. Alvarez’s List of Profane Words
profanity_arr_bad Stackoverflow user2592414’s List of Profane Words
profanity_banned bannedwordlist.com’s List of Profane Words
profanity_racist Titus Wormer’s List of Racist Words
profanity_zac_anger Zac Anger’s List of Profane Words
sw_dolch Leveled Dolch List of 220 Common Words
sw_fry_100 Fry’s 100 Most Commonly Used English Words
sw_fry_1000 Fry’s 1000 Most Commonly Used English Words
sw_fry_200 Fry’s 200 Most Commonly Used English Words
sw_fry_25 Fry’s 25 Most Commonly Used English Words
sw_jockers Matthew Jocker’s Expanded Topic Modeling Stopword List
sw_loughran_mcdonald_long Loughran-McDonald Long Stopword List
sw_loughran_mcdonald_short Loughran-McDonald Short Stopword List
sw_lucene Lucene Stopword List
sw_mallet MALLET Stopword List
sw_python Python Stopword List


The action verb dataset is a good one for our purposes.

verbs_df<-data.frame(word=pos_action_verb)





Download data


destfile<-"hemingway.txt"

url <- "https://www.gutenberg.org/files/61085/61085-0.txt"

raw_text <-read.fwf(url, width=1000 )





Tidy the Data.


The Hemingway file has a lot of non ascii characters, and blank lines and extra verbiage.
We will use gsub to remove anything not in the range of ascii characters.
The ascci character starts at hexadecimal 20 (space) and ends at hexadecimal 7E (tilde)
and includes all alphanumeric characters and punctuation marks.

raw_text2 <- na.omit(raw_text)
colnames(raw_text2)<-"V1"
raw_text3<-data.frame(gsub("[^\x20-\x7E]", "", raw_text2$V1))



hemmingway_df <- data.frame(
  text     =character()
)


keep=0



for(i in 1:nrow(raw_text3)) {     

  
  if (str_detect(raw_text3[i,], "Here ends _The Inquest_")) {  # end here
    keep=0
  }
  
  
  if (keep==1) {
    hemmingway_df<-rbind(hemmingway_df,as.data.frame(raw_text3[i,]))
  }
  
  
  if (str_detect(raw_text3[i,], "chapter 1")) {   # start here
    keep=1
  }

}


colnames(hemmingway_df)<-"text"





Seperate the words in rows.


hemmingway_words_df <- hemmingway_df %>%
  unnest_tokens(word, text)





Isolate the 20 most common verbs from both authors.


verb_count_eh<-hemmingway_words_df %>%
  inner_join(verbs_df) %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  head(n=20)


verb_count_ja<-tidy_books %>%
  inner_join(verbs_df) %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  head(n=20)





Display the verb counts side by side.


plot1<-verb_count_ja %>% ggplot(aes(y=word, x=n)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Jane Austen')



plot2<-verb_count_eh %>% ggplot(aes(y=word, x=n)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Ernest Hemingway')

grid.arrange(plot1, plot2, ncol = 2)

I would expect the themes of romance and war to be juxtaposed here. Its interesting to see think, hope and wish on the left and words like kill, face and forward on the right.





Extra question.


I’ve always admired Hemingway for his simple prose. Id like to compare the length of his words to those of Jane Austen.

mean(nchar(hemmingway_words_df$word))
## [1] 4.098023
mean(nchar(tidy_books$word))
## [1] 4.344304





Notes: This is an interesting baseline to explore the evolution of the language of literature over time.
Having only one short book by Hemingway is not sufficient but it was the only Hemingway book that I could find.
miss and man show up as verbs. Im pretty sure they are mostly not verbs.






References


[7] Julia Silge and David Robinson. “ Welcome to Text Mining with R ”. ORiely