#download the Rling file from online
install.packages(file.choose(), repos = NULL, type = "source")
#deal with modeest
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("genefilter")
#download from CRAN
install.packages(c("knitr", "modeest", "car", "rms", "visreg",
"googleVis", "party", "pvclust", "LSAfun",
"ngram", "tm", "slam", "tidytext", "topicmodels",
"tidyverse", "fields", "rgl", "rworldmap", "psych",
"ca", "FactoMineR", "janeaustenr", "dplyr", "stringr",
"tidyr", "ggplot2", "wordcloud", "reshape2"))Generally, lectures will be formatted with:
Data for this course will come in many forms, as language is inherently unstructured. We will mostly use tidy format as defined:
However, we might define an observation as a frequency or proportion of observations or a specific word (rather than a participant or person in a study), etc. Learning how to structure the data for our analyses will be part of the goal of each lecture.
Tokens - meaningful unit of text, often a word, but could be phrases, documents, sentences, etc. Thus, to keep our data in tidy format, we can use one token per row, to treat each token as an observation. Later in the semester, we will use term-by-document matrices and corpus objects, which will be formatted differently.
The tidytext package has many tools that we can use to help us analyze text information. Let’s load it and try sentiment analysis with the package.
When we read, we use our understanding of words to help determine meaning. Often the semanticity of words includes their emotional intent. Using information about the valence of words, we can determine if a text is positive or negative (or other emotional descriptors).
Run the code below to see the graphic. Make sure you’ve downloaded the picture and put it in the same folder as this assignment.
The graphic below shows how you might treat a research workflow using tidyverse to analyze sentiment.
We are going to examine sentiment as a “sum of parts” - this approach means that we can look sum up the sentiments of individual words to represent the larger text.
The sentiments dataset as part of tidytext includes three sentiment lexicons: AFINN, bing, and nrc. There are several others we can use including one by Warriner et al., but these provide good coverage of common English words.
The dataset includes:
get_sentiments as a quick subset function to only grab one dataset depending on what you are interested in.The limitation to these datasets is that we have to remember when and how they were validated. One thing we will discuss this semester is the fact the word meanings change over time, so we have to consider time period for each analysis.
Another limitation to this approach is that context is ignored (sometimes this approach is considered “bag-of-words” because words are just tossed into a bag and totalled up). Qualifiers like “no” and “aren’t” are not considered - additionally, sarcasm and idioms will not be captured.
For this analysis, we are going to explore Jane Austen novels. You will want to change the parameters of the analysis while exploring the functionality of the code. You should fill in the information where requested - look for instructions in ALL CAPS.
#load the libraries
library(janeaustenr)
library(dplyr)
library(stringr)
#specific to this package, pull jane austen books and create a tidy dataframe
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)Specifically, here check out the unnest_tokens function - word is the output column, while text is the input column.
WHAT DOES IT APPEAR THAT THE unnest_tokens FUNCTION DID? TRY RUNNING THE CODE WITH AND WITHOUT THE LAST LINE.
Let’s first see what happens prior to the unnest_tokens function
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
tidy_booksSo the code prior to the unnest_tokens function produces a dataset where each line in each book is a record, with information about the book, line number, and chapter number. The unnest_tokens function will then ‘unnest’ the nested bunch of words in a line to create a dataset, where each word in a book is a record, with information about the name of the book, the line number, and the chapter number corresponding to the particular word as shown below:
The code provided analyzes the “joy” sentiment in “Emma”.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>% #this function merges sentiment with the Emma data
count(word, sort = TRUE) #makes a frequency table EDIT THE CODE TO USE A DIFFERENT EMOTION AND NOVEL.
Let’s try to analyze the ‘fear’ sentiment in ‘Sense & Sensibility’.
nrc_fear <- get_sentiments("nrc") %>%
filter(sentiment == "fear")
tidy_books %>%
filter(book == "Sense & Sensibility") %>%
inner_join(nrc_fear) %>%
count(word, sort = TRUE) WHAT ARE THE TOP WORDS IN YOUR EMOTION AND NOVEL?
The top 5 words representing fear in the Novel, ‘Sense and Sensibility’ are below:
DO THERE APPEAR TO BE SOME WORDS THAT ARE SURPRISING TO YOU? (I.E. THEY DO NOT SEEM TO MATCH WHAT YOU MIGHT EXPECT TO FIND AS FREQUENT FOR THAT EMOTION)
Not all the above words above seem to truly represent fear. Doubt represents fear more often than not. Illness might cause fear. However, suprise does not always represent fear, as in the case of a pleasant surprise. Confidence almost represents the opposite of fear. Case doesn’t really represent fear; it depends on the scenario.
We should consider the size of text to analyze for sentiment. If we use a whole document, the effects of sections of sentiment (like one sad chapter) may get washed out. However, you may not want to use small sentences because you might miss the larger structure of the text. The suggestion from the book is to use ~ 80 lines of text, and she is pretty smart, so let’s try that.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
jane_austen_sentimentWHAT DID THIS CODE APPEAR TO CREATE FOR US?
The ‘bing’ lexicon from the ‘sentiments’ dataset contains only two types of sentiments: postive and negative. The above code first inner joins each word in the ‘tidy_books’ dataset to each word in the ‘sentiments’ dataset corresponding to the ‘bing’ lexicon and creates postive/negative sentiment flags for them. It then considers 80 lines of text at a time from each book and counts the number of words representing positive and negative sentiments respectively. Next, it spreads out positive and negative sentiments from the sentiment column into two separate columns, positive and negative. Finally, it calculates the sentiment associated with each book and index (every 80 lines of text), defined here as the difference between the sum total of words representing postive sentiments and that representing negative sentiments for every 80 lines of text for each book.
We can use ggplot2 to plot the sentiment across the predefined chunks of text. This plot is similar to a lexical dispersion plot, which shows the instances of a word across a text.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") +
theme_bw()EXAMINE THE GRAPH - WHAT BOOK APPEARS TO HAVE THE MOST POSITIVE INTERPRETATION? THE MOST NEGATIVE?
From the graphs, it seems like the book, ‘Persuassion’ has the most positive interpretation and the book, ‘Northanger Abbey’ has the most negative interpretation. But it is difficult to confirm using just the graph due to the varied lengths of each of the novels. Let’s try to confirm this using non-visual data analysis.
There are two ways to interpret this:
Using the first interpretation, the unweghted sentiments for each is book is shown below:
jane_austen_sentiment_VDE <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
mutate(marker = ifelse(sentiment > 0, "pos", "neg"), total_words = positive + negative, sentiment_abs = abs(sentiment))
mytable <- table(jane_austen_sentiment_VDE$book, jane_austen_sentiment_VDE$marker)
unweighted_sentiments <- prop.table(mytable,1)
unweighted_sentiments##
## neg pos
## Sense & Sensibility 0.2784810 0.7215190
## Pride & Prejudice 0.2760736 0.7239264
## Mansfield Park 0.2343750 0.7656250
## Emma 0.1970443 0.8029557
## Northanger Abbey 0.2929293 0.7070707
## Persuasion 0.1428571 0.8571429
Based on the first interpretation which uses unweighted sentiments, we can confirm that the book, ‘Persuasion’ has the most positive interpretation and the book, ‘Northanger Abbey’ has the most negative interpretation.
Now let’s check for the weighted sentiments based on the second interpretation.
weighted_sentiments <- jane_austen_sentiment_VDE %>%
group_by(book) %>%
summarize (pos = sum(sentiment)/sum(sentiment_abs)) %>%
mutate (neg = 1-pos) %>%
select (book, neg, pos)
data.frame(weighted_sentiments)We see that even when we consider the second interpretation that uses weighted sentiments, the book, ‘Persuasion’ has the most positive interpretation and the book, ‘Northanger Abbey’ has the most negative interpretation.
CHANGE THE NUMBER OF LINES TO SOMETHING SMALLER LIKE 10-20 OR MUCH LARGER LIKE 200 - RERUN THE CODE AND GRAPH. WHAT CHANGES DO YOU SEE?
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 10, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") +
theme_bw()It is extremely difficult to make much of an interpretation about the overall book based on this.
Let’s look at the unwwighted and weighted sentiments
jane_austen_sentiment_VDE <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 10, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
mutate(marker = ifelse(sentiment > 0, "pos", "neg"), total_words = positive + negative, sentiment_abs = abs(sentiment))
mytable <- table(jane_austen_sentiment_VDE$book, jane_austen_sentiment_VDE$marker)
unweighted_sentiments <- prop.table(mytable,1)
unweighted_sentiments##
## neg pos
## Sense & Sensibility 0.4464856 0.5535144
## Pride & Prejudice 0.4189294 0.5810706
## Mansfield Park 0.4153041 0.5846959
## Emma 0.4006192 0.5993808
## Northanger Abbey 0.4461538 0.5538462
## Persuasion 0.3808948 0.6191052
Looking at unweighted sentiments by considering 10 lines of text as index, the book, ‘Persuasion’ still has the most positive interpretation, albeit to a much lesser degree. However, based on this the book, ‘Sense & Sensibility’ has the most negative interpretation.
weighted_sentiments <- jane_austen_sentiment_VDE %>%
group_by(book) %>%
summarize (pos = sum(sentiment)/sum(sentiment_abs)) %>%
mutate (neg = 1-pos) %>%
select (book, neg, pos)
data.frame(weighted_sentiments)Based on the weighted sentiments, the book, ‘Persuasion’ still has the most positive interpretation. However, based on weighted sentiments for every 10 lines of text, the book, ‘Northanger Abbey’ has the most negative interpretation.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 200, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") +
theme_bw()This seems to be much more clearer. The book, ‘Persuasion’ still seems to have the most positive interpretation. However, the book, ‘Northanger Abbey’ doesn’t seem to have that much of a negative interpretation as before.
Let’s look at the unwwighted and weighted sentiments
jane_austen_sentiment_VDE <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 200, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
mutate(marker = ifelse(sentiment > 0, "pos", "neg"), total_words = positive + negative, sentiment_abs = abs(sentiment))
mytable <- table(jane_austen_sentiment_VDE$book, jane_austen_sentiment_VDE$marker)
unweighted_sentiments <- prop.table(mytable,1)
unweighted_sentiments##
## neg pos
## Sense & Sensibility 0.21875000 0.78125000
## Pride & Prejudice 0.21212121 0.78787879
## Mansfield Park 0.15584416 0.84415584
## Emma 0.12195122 0.87804878
## Northanger Abbey 0.22500000 0.77500000
## Persuasion 0.07142857 0.92857143
As expected, the book, ‘Persuasion’ has the most positive interpretation. Although we couldn’t really interpret based on the graphs due to the different lengths of each of the novels, the book, ‘Northanger Abbey’ has the most negative interpretation.
weighted_sentiments <- jane_austen_sentiment_VDE %>%
group_by(book) %>%
summarize (pos = sum(sentiment)/sum(sentiment_abs)) %>%
mutate (neg = 1-pos) %>%
select (book, neg, pos)
data.frame(weighted_sentiments)Based on the weighted sentiments, the book, ‘Persuasion’ still has the most positive interpretation. However, based on weighted sentiments for every 200 lines of text, the book, ‘Sense & Sensibility’ has the most negative interpretation.
In conclusion, picking too few lines for index causes the overarching sentiment of a situation to be dispersed between multiple indexes and picking too many lines for index causes the overarching sentiment of a situation to be lost between multiple situations present in the big index. The ideal number of lines to be considered as an index would depend on the book. In case of Jane Austen’s novel, we can safely say that ‘Persuasion’ has the most positive sentiment and ‘Northanger Abbey’ has the most negative sentiment, based on the analysis so far.
The choice in lexicon might be based on word overlap (i.e. it has the words you need) or based on what you want to analyze. Because we have more than one, we can compare them directly.
This code pulls each sentiment and merges it with the book chosen above. The plot at the end compares each of the methods. Here it is most appropriate to use the positive and negative categories from NRC to match the bing and AFINN datasets. Otherwise we might not be comparing the same ideas.
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")EXAMINING YOUR BOOK - DO THE THREE SOURCES APPEAR TO AGREE? WHAT ARE THE MAJOR DIFFERENCES OR SIMILARITIES?
Comparing the three sources below:
THIS EXAMPLE IS FOR PRIDE & PREJUDICE. CHANGE THE CODE HERE TO USE A DIFFERENT BOOK.
Let’s try this for the book, ‘Northanger Abbey’.
northangerabbey <- tidy_books %>%
filter(book == "Northanger Abbey")
afinn <- northangerabbey %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(northangerabbey %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
northangerabbey %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Comparing the lexicons, we see similar results as before, where ‘NRC’ classifies almost every index as positive, ‘AFINN’ seems to pick negative indices better than ‘NRC’, but attributes high weights to positive indices due to the score associated with it, and ‘Bing’ seems very balanced.
Let’s figure out the most common positive and negative words across all of Jane Austen’s texts.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
head(bing_word_counts)LOOK AT THE TOP WORDS HERE. WHY MIGHT A FEW OF THESE BE PROBLEMATIC/MISINTERPRETED? THINK ABOUT THE STYLE OF WRITING FOR THESE NOVELS.
The top words and why they might be problematic are below:
All other words can be positive, negative, or neither depending on the context.
Let’s plot that analysis for easier viewing:
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
theme_bw()CHANGE THE CODE ABOVE TO REFLECT ONLY ONE OF THE NOVELS IN THE DATASET. WHAT DO YOU OBSERVE ABOUT THE MOST USED POSITIVE AND NEGATIVE WORDS?
Let’s do this for ’Northanger Abbey".
bing_word_counts <- northangerabbey %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
head(bing_word_counts)bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
theme_bw()Looking at the top words, the word, ‘miss’ is again the top negative word and also the most frequently ocurring word. Considering the context in which this word is used, it might not imply negativity. Therefore, the book, ‘Northanger Abbey’ might not be as negative as we initially interpreted. Another interesting thing to notice is that most frequently occuring positive and negative words in this book are more or less the same when considering all her books, implying that her style of writing, with respect to the sentiments it conveys, does not vary much.
Word clouds are a popular visualization tool for text analysis. We can use the wordcloud library to help us create those plots. This analysis ignores stop words which are common words that appear a lot like “the, an, of, into”.
library(wordcloud)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50))It might be more interesting though to compare positive versus negative in the same plot:
library(reshape2)
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 50)EDIT THE ABOVE CODE TO ONLY INCLUDE ONE OF THE BOOKS. WHAT DO YOU FIND TO BE THE MOST POSITIVE AND NEGATIVE WORDS IN YOUR BOOK (SHOULD MATCH ABOVE).
Let’s do this for the book, ‘Northanger Abbey’.
northangerabbey %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50))northangerabbey %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 50)Although the word cloud for the overall frequency of words varies between the one for the all the books and the one for the book, ‘Northanger Abbey’ due to the number of times names of people occur, the positive and negative sentiment word cloud has similar words with similar sizes which shows that her style of writing with respect to sentiments atleast does not vary much across her books. The most commonly occuring negative words in ‘Northanger Abbey’ such as ‘miss’, ‘scarcely’, ‘poor’, ‘doubt’, and ‘sorry’ and the most commonly occuring positive words in it such as ‘well’, ‘good’, ‘great’, ‘like’, and ‘enough’ are among the most commonly occuring negative and positive words respectively across all her novels.
You now have the skills to explore a set of text for positive and negative sentiment! You can apply these ideas to many types of text. In a future session, we will explore Twitter word usage and sentiment.
To turn in this assignment, hit KNIT at the top. You will submit the report in html/pdf/word format (default is html) on Moodle for credit. Be sure you have answered the questions. Great job!