STA 279 Lab 4

Complete all Questions.

The Goal

We have been learning about the foundations of sentiment analysis. The goal for today is to apply what we have learned, as well as digging a little deeper into what we can do with sentiment analysis.

The Data Set

As sentiment analysis is designed to measure emotion, one common application is analyzing books, movies, or other media text. Today, we are going to work with a collection of books written by Jane Austen.

To load the data, use the following:

# Load the libraries
library(janeaustenr)
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)

books <- austen_books()

books <- books |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE))))

The data set books has \(n=73422\) rows and 2 columns. The first column text is part of the text of the book, and the second column book tells us which of Jane Austen’s books the text came from.

Question 1

How many unique Jane Austen books are included in this data set?

Question 2

What text is present on Line 9 of this data set?

This is the first time we’ve worked with text that is longer than a few lines! The words in the books were input line by line for each book. Looking at the data, we can see that this means blank lines were included, as well as things like chapter numbers.

Question 3

Create tidy_books by tokenizing books. Do not remove stop words or punctuation. How many rows are in the resultant data set tidy_books?

Pride & Prejudice: Bing

Let’s start our work today by focusing on only one book: Pride & Prejudice.

Question 4

Create a subset of tidy_books called tidy_PandP which contains only the words from Pride & Prejudice. How many rows (and therefore how many words in total) are in tidy_PandP?

When we perform sentiment analysis, we generally look at the words in the text and determine what emotion each word conveys. Is it a positive word? A negative word? A word about surprise?

To determine what emotion a word expresses, we use lexicons. There are many lexicons to choose from, but for today, let’s start by exploring the Bing lexicon.

The Bing lexicon has a list of words and has tagged each of them as either “positive” or “negative”. We can see this if we run the code below:

# Load the lexicon
binglexicon <- get_sentiments("bing")
head(binglexicon)

If you look at binglexicon, you will notice it has two columns. The first column holds the words in the bing lexicon. The second column, sentiment, states which sentiment is associated with each word. There are only two options for sentiment in this lexicon: positive or negative.

If we only want to find the words that are positive, we therefore use

bing_positive <- get_sentiments("bing") |>
     filter(sentiment == "positive")

Remember, filter allows us to choose rows in a data set that have a specific property! In this case, we choose only rows with positive in the sentiment column.

Question 5

How many negative words are in the Bing lexicon?

Okay, great. So far, we have a list of positive words and a list of negative words. What can we do with this?

One thing we can do is count how many words appearing on the negative list and how many words appearing on the positive list are in our text of interest, which is Pride & Prejudice. Let’s start by looking at the positive words in tidy_PandP. We want to count how many positive words from the bing lexicon occur in the book. This means that a first step is ignore any words in Pride & Prejudice that are not in our positive bing lexicon. We can do this with the following code:

# Start with tokenized data
tidy_PandPpositive <- tidy_PandP |>
  # Keep only the positive words
  inner_join(bing_positive)

Question 6

What percent of all the words in Pride & Prejudice are positive words, according to the Bing lexicon?

You will notice that when we run commands like inner_join or other merge commands, you get a message output on your screen:

Joining withby = join_by(word)`

If you want to hide this output, you can change your chunk header to be {r, message = FALSE, warning= FALSE} If you’d like to make this change for all chunks, let Dr. Dalzell know! She can show you how to do this all at once so you don’t have to do it for each chunk individually. As a hint, this will be very helpful for Data Analysis 1!

Question 7

What percent of all the words in Pride & Prejudice are negative words, according to the Bing lexicon?

Once we have determined which words in the book are positive, we can count how many times each positive word appears. This allows us to find the most frequently occurring positive words in the book.

Question 8

Consider these two count commands:

# Count Command 1 
count_PandPpositive <- tidy_PandP |>
  inner_join(bing_positive) |>
  count()
# Count Command 2
count_PandPpositive <- tidy_PandP |>
  inner_join(bing_positive) |>
  count(word)

The only difference is whether or not we have word inside the parentheses.

  1. Our goal is to count how many times each positive word appears in the book. Which of these two options (Command 1 or Command 2) do we want?

  2. Run the line of code you chose in (b) and state which positive word occurs most often in the book. Hint: Remember that if you want the most frequently occurring word to appear on the top of your result, you need to add sort = TRUE to your count command. This means either count(sort = TRUE) or count(word, sort = TRUE), depending on which option you chose.

Question 9

Using the Bing lexicon, make a formatted table of the top 15 positive words in Pride & Prejudice.

Question 10

Using the Bing lexicon, make a formatted table of the top 15 negative words in Pride & Prejudice.

Considering “miss”

Here in the negative words, we notice something. The word “miss” is listed. While to “miss” an event or to “miss” someone is indeed negative, in the case of a Jane Austen novel “miss” would be used repeatedly in front of the names of characters: Miss Bennett, Miss Elizabeth, etc. This means that “miss” is essentially a stop word for Jane Austen novels!!

This sort of thing actually happens all the time in text analysis. Words are not static; their uses and meanings change over time and with different applications. This means we often have to think critically about how to adapt techniques based on the context of the text we are working with.

There are a few options for dealing with words that should be stop words in a certain analysis. The first is to add the word to the list of stop words and then use anti_join as usual to remove the standard stop words and the custom stop word :

custom_stop_words <- bind_rows(tibble(word = c("miss"),
                    lexicon = c("custom")), stop_words)

The second method is to change the sentiment of a specific word, for instance by labeling labeling “miss” as neutral in a lexicon.

# Store the lexicon
bing_custom <- get_sentiments("bing") 

# Assign "miss" a neutral sentiment
bing_custom <- bing_custom |>
  mutate(sentiment = ifelse(word=="miss","neutral", sentiment))

# Print it out
bing_custom |>
  filter(word=="miss")

This means that when we filter to positive or negative sentiments only, this word will be excluded.

The final way, and the one we will use for now, is just to remove the word from the tidy_PandP.

tidy_PandP <- tidy_PandP |>
  anti_join(tibble(word = c("miss")))

Question 11

Using the Bing lexicon after removing “miss”, make a formatted table of the top 15 negative words in Pride & Prejudice.

Comparing

Now we have looked at the negative counts and positive counts separately, let’s compare them!

Question 12

  1. With the word “miss” removed, make a bar chart to show the top 15 negative words and top 15 positive words in Pride and Prejudice.

  2. Overall, are there more positive words or negative words in Pride and Prejudice?

Okay, great! We can determine common words associated with different sentiments and how often these words occur in a text. What else can we do with sentiment analysis?

Changes in Sentiment Across the Book

One cool thing we can do with sentiment analysis is determine how sentiment changes throughout the book. Books have sad scenes, happy scenes, etc. We can track this using sentiment analysis.

Let’s start by counting the number of positive and negative words in each chapter of the book. Before we do this, take a look at tidy_PandP. Does it contain the variable chapter? If so, you can skip the next step. If not, this means that we need to tokenize our data again:

tidy_PandP <- books |>
  # Choose only P and P 
  filter(book == "Pride & Prejudice") |>
  # Group by chapter
  group_by(chapter) |>
  # Break into words
  unnest_tokens(word , text) |>
  # Remove the stop word "miss"
  anti_join(tibble(word = c("miss")))

With chapter now available to use, we can count the positive and negative words in each chapter.

sentiment_bychapter <- tidy_PandP |>
  inner_join(get_sentiments("bing")) |>
  group_by(chapter)|>
  count(sentiment)

Here, you will notice that we have grouped by chapter, because we want to count the number of positive and negative words in each chapter.

Question 13

How many positive words are in Chapter 6? How many negative words are in Chapter 6?

At this point, it would be cool if we could make a plot to see how the number of positive words and negative words differs by chapter. However, this would require sentiment_bychapter to have each row be a Chapter, with one column for the number of positive words and another for the number of negative words. Our data doesn’t look like that right now…but it can.

To get the data in the format that we want, we can use the pivot_wider command in R.

The new columns we want to create are in the sentiment column in PandP_sentiment, so we want the options in this current column to form two new columns: positive and negative. We want to fill in the values with the corresponding value of n (the number of words). To achieve this, we use:

sentiment_bychapter <- tidy_PandP |>
  # Choose only the positive words
  inner_join(get_sentiments("bing")) |>
  # Count how mant positive and negative
  # words are in each chapter
  count(chapter, sentiment) |>
  # Reformat the columns 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0)

We can now create a plot to see how the sentiment changes throughout the book! Let’s see how the number of positive words changes throughout the chapters.

Question 14

Create a professionally formatted bar plot showing the number of positive words in each chapter. The x axis should be the chapter number and the y axis should be the number of positive words.

Question 15

Create a professionally formatted bar plot showing the number of negative words in each chapter. The x axis should be the chapter number and the y axis should be the number of negative words.

Another way of comparing how the number of positive and negative words progress across the book is to create a sentiment score. For instance, we could subtract the number of negative words from the number of positive words.

sentiment_bychapter  <- sentiment_bychapter  |>
  mutate(sentimentscore = positive - negative)

Question 16

Create a plot showing how the sentiment score changes throughout the book. This means the x axis should be the chapter and the y axis should be the sentiment score.

Hint: If you want the positive and negative scores to be colored differently, one way to do this is to have aes(chapter, sentiment,fill = (sentimentscore>0))) be part of your plot command in the appropriate place. If you don’t like the default colors that this chooses, you can change them by adding a line to the end of your ggplot command: +scale_fill_manual(values = c("COLOR1", "COLOR2"), guide = FALSE) and filling in two colors of your choice!

Question 17

In which chapters do the words appear to be predominately negative?

Considering other lexicons

Our analysis so far focuses on using one lexicon, which is the Bing lexicon. Would our conclusions change if we used a different lexicon??

Question 18

Create an AFINN score for each chapter in Pride & Prejudice. Create plot showing how the sentiment score changes across each chapter of the book.

Question 19

Discuss whether or not any conclusions about the positivity or negativity of the book chapters changes if we switch from the Bing lexicon to the AFINN lexicon. In other words, does the trend of sentiment seem to be different depending on which of the sentiment lexicons you choose, or is the pattern of sentiment about the same? Explain.

Other Sentiments

So far, we have focused on positive or negative sentiments. The bing and AFINN lexicons are only able to measure these two sentiments. However, there are a lot of other possible sentiments we can explore!

To start exploring other sentiments, we will use the nrc lexicon.

nrc_hold <-get_sentiments("nrc")

Question 20

How many different sentiments are measured in the nrc lexicon?

Pride & Prejudice is all about the main characters learning to trust one another. Let’s see if we can see how words related to trust change throughout the book!

Question 21

Before we do that, can you think of any other applications when it might helpful to track how trust changes over time?

To focus on only the trust words in the nrc lexicon, we can use:

nrc_trust <- get_sentiments("nrc") %>% 
  filter(sentiment == "trust")

We can then treat nrc_trust like we have any other lexicon.

Question 22

Create a plot showing how the number of trust words changes as the chapters in Pride and Prejudice progress.

References

Lexicons

NRC: Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.

AFINN: Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.

bing: Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

Data

Silge J (2022). janeaustenr: Jane Austen’s Complete Novels. R package version 1.0.0, https://CRAN.R-project.org/package=janeaustenr.

Code

This analysis and code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.