STA 279 Lab 4
Complete all Questions.
The Goal
We have been learning about the foundations of sentiment analysis. The goal for today is to apply what we have learned, as well as digging a little deeper into what we can do with sentiment analysis.
The Data Set
As sentiment analysis is designed to measure emotion, one common application is analyzing books, movies, or other media text. Today, we are going to work with a collection of books written by Jane Austen.
To load the data, use the following:
# Load the libraries
library(janeaustenr)
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)
books <- austen_books()
books <- books |>
group_by(book) |>
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
The data set books
has \(n=73422\) rows and 2 columns. The first
column text
is part of the text of the book, and the second
column book
tells us which of Jane Austen’s books the text
came from.
Question 1
How many unique Jane Austen books are included in this data set?
Question 2
What text is present on Line 9 of this data set?
This is the first time we’ve worked with text that is longer than a few lines! The words in the books were input line by line for each book. Looking at the data, we can see that this means blank lines were included, as well as things like chapter numbers.
Question 3
Create tidy_books
by tokenizing books
. Do
not remove stop words or punctuation. How many rows are in the resultant
data set tidy_books
?
Pride & Prejudice: Bing
Let’s start our work today by focusing on only one book: Pride & Prejudice.
Question 4
Create a subset of tidy_books
called
tidy_PandP
which contains only the words from Pride &
Prejudice. How many rows (and therefore how many words in total) are in
tidy_PandP
?
When we perform sentiment analysis, we generally look at the words in the text and determine what emotion each word conveys. Is it a positive word? A negative word? A word about surprise?
To determine what emotion a word expresses, we use lexicons. There are many lexicons to choose from, but for today, let’s start by exploring the Bing lexicon.
The Bing lexicon has a list of words and has tagged each of them as either “positive” or “negative”. We can see this if we run the code below:
If you look at binglexicon
, you will notice it has two
columns. The first column holds the words in the bing lexicon. The
second column, sentiment
, states which sentiment is
associated with each word. There are only two options for sentiment in
this lexicon: positive or negative.
If we only want to find the words that are positive, we therefore use
Remember, filter
allows us to choose rows in a data set
that have a specific property! In this case, we choose only rows with
positive
in the sentiment
column.
Question 5
How many negative words are in the Bing lexicon?
Okay, great. So far, we have a list of positive words and a list of negative words. What can we do with this?
One thing we can do is count how many words appearing on the negative
list and how many words appearing on the positive list are in our text
of interest, which is Pride & Prejudice. Let’s start by
looking at the positive words in tidy_PandP
. We want to
count how many positive words from the bing lexicon occur in the book.
This means that a first step is ignore any words in Pride &
Prejudice that are not in our positive bing lexicon. We can do this with
the following code:
# Start with tokenized data
tidy_PandPpositive <- tidy_PandP |>
# Keep only the positive words
inner_join(bing_positive)
Question 6
What percent of all the words in Pride & Prejudice are positive words, according to the Bing lexicon?
You will notice that when we run commands like
inner_join
or other merge commands, you get a message
output on your screen:
Joining with
by = join_by(word)`
If you want to hide this output, you can change your chunk header to
be {r, message = FALSE, warning= FALSE}
If you’d like to
make this change for all chunks, let Dr. Dalzell know! She can show you
how to do this all at once so you don’t have to do it for each chunk
individually. As a hint, this will be very helpful for Data Analysis
1!
Question 7
What percent of all the words in Pride & Prejudice are negative words, according to the Bing lexicon?
Once we have determined which words in the book are positive, we can count how many times each positive word appears. This allows us to find the most frequently occurring positive words in the book.
Question 8
Consider these two count commands:
The only difference is whether or not we have word
inside the parentheses.
Our goal is to count how many times each positive word appears in the book. Which of these two options (Command 1 or Command 2) do we want?
Run the line of code you chose in (b) and state which positive word occurs most often in the book. Hint: Remember that if you want the most frequently occurring word to appear on the top of your result, you need to add
sort = TRUE
to yourcount
command. This means eithercount(sort = TRUE)
orcount(word, sort = TRUE)
, depending on which option you chose.
Question 9
Using the Bing lexicon, make a formatted table of the top 15 positive words in Pride & Prejudice.
Question 10
Using the Bing lexicon, make a formatted table of the top 15 negative words in Pride & Prejudice.
Considering “miss”
Here in the negative words, we notice something. The word “miss” is listed. While to “miss” an event or to “miss” someone is indeed negative, in the case of a Jane Austen novel “miss” would be used repeatedly in front of the names of characters: Miss Bennett, Miss Elizabeth, etc. This means that “miss” is essentially a stop word for Jane Austen novels!!
This sort of thing actually happens all the time in text analysis. Words are not static; their uses and meanings change over time and with different applications. This means we often have to think critically about how to adapt techniques based on the context of the text we are working with.
There are a few options for dealing with words that should be stop
words in a certain analysis. The first is to add the word to the list of
stop words and then use anti_join
as usual to remove the
standard stop words and the custom stop word :
The second method is to change the sentiment of a specific word, for instance by labeling labeling “miss” as neutral in a lexicon.
# Store the lexicon
bing_custom <- get_sentiments("bing")
# Assign "miss" a neutral sentiment
bing_custom <- bing_custom |>
mutate(sentiment = ifelse(word=="miss","neutral", sentiment))
# Print it out
bing_custom |>
filter(word=="miss")
This means that when we filter to positive or negative sentiments only, this word will be excluded.
The final way, and the one we will use for now, is just to remove the
word from the tidy_PandP
.
Question 11
Using the Bing lexicon after removing “miss”, make a formatted table of the top 15 negative words in Pride & Prejudice.
Comparing
Now we have looked at the negative counts and positive counts separately, let’s compare them!
Question 12
With the word “miss” removed, make a bar chart to show the top 15 negative words and top 15 positive words in Pride and Prejudice.
Overall, are there more positive words or negative words in Pride and Prejudice?
Okay, great! We can determine common words associated with different sentiments and how often these words occur in a text. What else can we do with sentiment analysis?
Changes in Sentiment Across the Book
One cool thing we can do with sentiment analysis is determine how sentiment changes throughout the book. Books have sad scenes, happy scenes, etc. We can track this using sentiment analysis.
Let’s start by counting the number of positive and negative words in
each chapter of the book. Before we do this, take a look at
tidy_PandP
. Does it contain the variable
chapter
? If so, you can skip the next step. If not, this
means that we need to tokenize our data again:
tidy_PandP <- books |>
# Choose only P and P
filter(book == "Pride & Prejudice") |>
# Group by chapter
group_by(chapter) |>
# Break into words
unnest_tokens(word , text) |>
# Remove the stop word "miss"
anti_join(tibble(word = c("miss")))
With chapter now available to use, we can count the positive and negative words in each chapter.
sentiment_bychapter <- tidy_PandP |>
inner_join(get_sentiments("bing")) |>
group_by(chapter)|>
count(sentiment)
Here, you will notice that we have grouped by chapter, because we want to count the number of positive and negative words in each chapter.
Question 13
How many positive words are in Chapter 6? How many negative words are in Chapter 6?
At this point, it would be cool if we could make a plot to see how
the number of positive words and negative words differs by chapter.
However, this would require sentiment_bychapter
to have
each row be a Chapter, with one column for the number of positive words
and another for the number of negative words. Our data doesn’t look like
that right now…but it can.
To get the data in the format that we want, we can use the
pivot_wider
command in R.
The new columns we want to create are in the sentiment
column in PandP_sentiment
, so we want the options in this
current column to form two new columns: positive and negative. We want
to fill in the values with the corresponding value of n
(the number of words). To achieve this, we use:
sentiment_bychapter <- tidy_PandP |>
# Choose only the positive words
inner_join(get_sentiments("bing")) |>
# Count how mant positive and negative
# words are in each chapter
count(chapter, sentiment) |>
# Reformat the columns
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)
We can now create a plot to see how the sentiment changes throughout the book! Let’s see how the number of positive words changes throughout the chapters.
Question 14
Create a professionally formatted bar plot showing the number of positive words in each chapter. The x axis should be the chapter number and the y axis should be the number of positive words.
Question 15
Create a professionally formatted bar plot showing the number of negative words in each chapter. The x axis should be the chapter number and the y axis should be the number of negative words.
Another way of comparing how the number of positive and negative words progress across the book is to create a sentiment score. For instance, we could subtract the number of negative words from the number of positive words.
Question 16
Create a plot showing how the sentiment score changes throughout the book. This means the x axis should be the chapter and the y axis should be the sentiment score.
Hint: If you want the positive and negative scores
to be colored differently, one way to do this is to have
aes(chapter, sentiment,fill = (sentimentscore>0)))
be
part of your plot command in the appropriate place. If you don’t like
the default colors that this chooses, you can change them by adding a
line to the end of your ggplot
command:
+scale_fill_manual(values = c("COLOR1", "COLOR2"), guide = FALSE)
and filling in two colors of your choice!
Question 17
In which chapters do the words appear to be predominately negative?
Considering other lexicons
Our analysis so far focuses on using one lexicon, which is the Bing lexicon. Would our conclusions change if we used a different lexicon??
Question 18
Create an AFINN score for each chapter in Pride & Prejudice. Create plot showing how the sentiment score changes across each chapter of the book.
Question 19
Discuss whether or not any conclusions about the positivity or negativity of the book chapters changes if we switch from the Bing lexicon to the AFINN lexicon. In other words, does the trend of sentiment seem to be different depending on which of the sentiment lexicons you choose, or is the pattern of sentiment about the same? Explain.
Other Sentiments
So far, we have focused on positive or negative sentiments. The bing and AFINN lexicons are only able to measure these two sentiments. However, there are a lot of other possible sentiments we can explore!
To start exploring other sentiments, we will use the nrc lexicon.
Question 20
How many different sentiments are measured in the nrc lexicon?
Pride & Prejudice is all about the main characters learning to trust one another. Let’s see if we can see how words related to trust change throughout the book!
Question 21
Before we do that, can you think of any other applications when it might helpful to track how trust changes over time?
To focus on only the trust words in the nrc lexicon, we can use:
We can then treat nrc_trust
like we have any other
lexicon.
Question 22
Create a plot showing how the number of trust words changes as the chapters in Pride and Prejudice progress.
References
Lexicons
NRC: Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.
AFINN: Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.
bing: Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.
Data
Silge J (2022). janeaustenr: Jane Austen’s Complete Novels. R package version 1.0.0, https://CRAN.R-project.org/package=janeaustenr.
Code
This analysis and code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.