STA 279 Lab 6

Complete all Questions.

The Goal

So far in this course, we have seen a lot of things we can do with individual words. However…words connect to other words, and sometimes when that happens their meanings change. Today we are going to start talking about n-grams, which means \(n\) words in a phrase. This means that instead of working with single words all the time, we will have the ability to analyze multiple word phrases.

The Data Set

As sentiment analysis is designed to measure emotion, one common application is analyzing books, movies, or other media text. Today, we are going to work with a collection of books written by Jane Austen.

To load the data, use the following:

# Load the libraries
library(janeaustenr)
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)

books <- austen_books()

books <- books |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE))))

books <- books[-which(books$text == ""),]

The data set books has \(n=62279\) rows and 2 columns. The first column text is part of the text of the book, and the second column book tells us which of Jane Austen’s books the text came from.

Question 1

How many unique Jane Austen books are included in this data set?

Question 2

What text is present on Line 4 of this data set?

This is the first time we’ve worked with text that is longer than a few lines! The words in the books were input line by line for each book. Looking at the data, we can see that this means blank lines were included, as well as things like chapter numbers.

Question 3

Create tidy_books by tokenizing books into words. Do not remove stop words or punctuation. How many rows are in the resultant data set tidy_books?

Typically, our first step is to tokenize our text into words. However, we can actually tokenize text by dividing into phrases instead of single words. We are going to start by tokenzing our text into bigrams. Bigrams are two-word phrases, like “good day” or “hard work”.

Tokenzing our text into bigrams uses a very similar code to what we are used to, but there are some differences.

# Tokenize the text into bigrams
tidy_Austen_bigrams <- austen_books() |>
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |> 
  filter(!is.na(bigram))

The first difference we notice is in the second line of code where we tokenize the text. The function unnest_tokens is the same, but instead of only two inputs like we have when we tokenize into words, tokenizing into bigrams requires four inputs.

    1. bigram: This tells R to create a column called bigram to store the tokenized text.
    1. text: This tells R which column we are tokenzing.
    1. token = "ngrams": This tells R we will tokenzing by breaking the text into phrases rather than words.
    1. n=2: This tells R specifically to tokenize into 2 word phrases, i.e., bigrams.

There is also a brand new line of code: filter(!is.na(bigram)). To figure out what this line of code does, let’s see what happens if we leave it out.

Question 4

Run the above code for tokenizing the text into bigrams, but remove the filter(!is.na(bigram)) line of code.

Look at tidy_Austen_bigrams (but do NOT print it in your Markdown file). Once you have done this, put the filter(!is.na(bigram)) line of code back and run the code again.

As your answer to this question, tell me what that filter(!is.na(bigram)) line of code does!

Question 5

Write a code to tokenize the text into trigrams, which means 3 word phrases. As the answer to this question, tell me how many trigrams are in the Jane Austen data set.

What we have just seen is that the code we are working with today can be extended to more than just bigrams! We can look at 3 word phrases, 4 word phrases, etc. Typically in practice, word, bigrams, and trigrams tend to be the most useful, but we have options!

For now, let’s focus on bigrams. We have explored the code for how to tokenize the text into bigrams, but what does tokenizing the text of the books into bigrams look like? To find out, let’s look at the first few rows in tidy_Austen_bigrams.

knitr::kable(head(tidy_Austen_bigrams))
book bigram
Sense & Sensibility sense and
Sense & Sensibility and sensibility
Sense & Sensibility by jane
Sense & Sensibility jane austen
Sense & Sensibility chapter 1
Sense & Sensibility the family

Let’s look at the first two bigrams: “sense and” and “and sensibility”. This is the bigram version of the phrase “sense and sensibility” which is the title of the book.

This means that when we separate into bigrams, we include all possible two word phrases. We don’t keep “sense and” together and then start the second bigram with “sensibility”. The only time this is not true is when the data has blank lines separating text, as it does between the title “sense and sensibility” and the author line “by Jane Austen”, and the author line “by Jane Austen” and the first chapter title.

It turns out that we can do very similar things with n-grams that we can with individual words. Let’s start with counting, TF-IDF, and stop words.

Counts and Stop Words

One of things we can do with n-grams is count the number of times each n-gram appears in a text. Just like when we do this with words, this can help us determine what the text is all about.

If we wanted to count the number of times each word appears, we use:

count_words <- tidy_Austen_words  |>
  count(word, sort = TRUE)

We can use a very similar code to count the number of times each bigram appears:

count_bigrams <- tidy_Austen_bigrams |>
  count(bigram, sort = TRUE)

Question 6

Print out the 10 bigrams that occur most frequently by using knitr::kable(count_bigrams[1:10, ]). What do you notice about these bigrams?

Ah, stop words!! The answer to Question 6 is our first hint that some of the issues that we saw when worked with words can also occur with n-grams. We are going to handle these issues in a very similar way, but the code will look a little different.

We know we can handle stop words by removing them, or by using a tool like the TF-IDF that effectively filters out the stop words by down weighting them. We will try both, but let’s start by thinking about how to remove stop words.

When we tokenized the text into words, we used anti_join to remove all words that were not stop words. Unfortunately, we can’t just remove the stop words before we tokenize when we work with n-grams. This is because removing stop words might make things bigrams that were not in the original text.

For example consider the phrase “were not in the original text”. If we removed the stop words “were”, “in” and “the”, we would get “not original text”. This means that when we tokenized into bigrams, we would get “not original” and “original text”. The second bigram is a bigram we had in the original phrase, but the first is not.

Instead of removing all the stop words before we tokenize, we are going to tokenize the text into bigrams first. We are then going to remove all of the bigrams that contain at least one stop word. This is going to require a few steps.

Step 1: We break up our bigrams into two words. The command separate is going to help us do this. It requires three inputs:

    1. what we want to break apart (the column bigrams)
    1. what new columns we want to create (word1 and word2)
    1. how the computer should tell when to break apart the bigrams. Our bigrams have a space in between each word, so a space (" ") is where one word stops and another starts!

This means we will use this code:

bigrams_separated <- tidy_Austen_bigrams |>
  separate(bigram, c("word1", "word2"), sep = " ")

Question 7

What is the content of row 8 in bigrams_separated?

Question 8

Show the code you would need to break up the trigrams from Question 5 into three words.

Step 2: Once we have broken the bigrams into two words, we can remove any rows where the first word or the second word is a stop word. We know that we can use filter to choose any rows that have specific qualities we want to keep. It turns out we can also use filter to remove rows that we don’t want! Here, the ! in the code tells R to exclude rows that are stop words.

bigrams_clean<- bigrams_separated |>
  # Remove any rows where word 1 is a stop word
  filter(!word1 %in% stop_words$word) |>
  # Remove any rows where word 2 is a stop word
  filter(!word2 %in% stop_words$word)

Question 9

Show the code you would need to remove the stop words from the trigrams from Question 8.

At this point, we have removed all the bigrams that include stop words. However, our bigrams are currently broken up into two individual words. To actually use the bigrams for analysis, we have to put the words back together again! We use unite for this.

tidy_Austen_bigrams_noStop <- bigrams_clean %>%
  unite(bigram, word1, word2, sep = " ")

The inputs are:

    1. bigram: what new column do we want to create
    1. word1 and (3) word2: what columns do we want to combine
    1. sep = " ": include a space between our two words in the bigram when we put them together

Question 10

Show the code you would need to re-unite the trigrams you split in Question 9.

So, we can remove stop words as we did with words, though it does take a little more work. At this point, the next step is to determine how often each of these bigrams occurs in the text.

Question 11

Create an object called bigrams_count that counts the number of times each bigram occurs in the data set, now that we have removed stop words. Print out the 10 bigrams that occur most frequently (after removing stop words).

Question 12

Create a plot (geom_col) to show the ten most frequently occurring trigrams after removing the stop words. Make sure to label your axes.

Gender Analysis

So far, we have seen that we can tokenize the text into n-grams (bigrams and trigrams so far). We can also remove stop words and count the number of n-grams that occur most frequently. This is all things we could also do when we tokenized text into words.

However, there are things we can do with n-grams that we could not with single words. For instance, what if we were curious what actions in Jane Austen’s books were associated with men vs women? In other words, what did “he” versus “she” do in her books?? This analysis was performed by Julia Silge on her blog, and we are going to replicate it!

In Jane Austen’s time, “she” and “he” were the pronouns primarily used to designate men and women. If we want to compare male and female actions, this means we are looking for things like “she said” or “he did”. This means we need to find all the bigrams with the first word being either “he” or “she”.

The first step is to store the key words we are looking for.

pronouns <- c("he", "she")

We have already seen that we can break up our bigrams into two words, as we just did when we created bigrams_separated before removing the stop words. However, in this case instead of removing all bigrams that contain stop words, we want to keep all bigrams that start with “he” or “she”.

To do this, we can use the following:

she_and_he_bigrams <- bigrams_separated |>
    # Keep only bigrams where the first word
    # is she or he
    filter(word1 %in% pronouns)

Notice that we are working with bigrams_separated, which means we have not removed stop words. This is important here because “she” and “he” are considered stop words.

We can add one more line to count the number of times each bigram appears.

she_and_he_bigrams_count <- bigrams_separated |>
    # Keep only bigrams where the first word
    # is she or he
    filter(word1 %in% pronouns)|>
    # Count the number of times each bigram appears 
    count(word1, word2, sort = TRUE) 

Question 13

Using the code above (so don’t remove stop words), how many unique bigrams start with “he”?

Question 14

Which bigram starting with “she” happens most often?

Question 15

Which bigram starting with “he” happens most often?

At this point, we would really like to see which words are associated more often with “she” than “he” in bigrams in Jane Austen, and vice versa. This allows us to describe things that tend to be associated with women vs. men in her books.

To do this, we are going to look only at the word2 column (the 2nd word in the bigram). We are then going to find the log-ratio of she vs. he for each word. This means we are going to compute:

\[log\left( \frac{\text{Number of Times this word comes after she}}{\text{Number of Times this word comes after he}}\right) \]

We use the log because this fraction can get very small, and using logs makes smaller numbers easier to work with.

Question 16

Suppose the log ratio is negative. Does this means that the word is more often associated with “she” or “he”? Explain briefly.

Question 17

Suppose the log ratio is positive. Does this means that the word is more often associated with “she” or “he”? Explain briefly.

Like when we work with the TF-IDF, we will not remove stop words when we work with log-ratios. The idea is that stop words are likely to happen with both she and he, so they should have values of the log-ratio that are close to 0.

To find the log-ratio in R, we can use the following. Note: You will get a warning. Ignore it!

word_ratios <- she_and_he_bigrams_count |>
    # Filter to words that occur at least 10 times with he or she
    group_by(word2) |>
    filter(sum(n) > 10) |>
    ungroup() |>
    spread(word1, n, fill = 0) |>
    mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) |>
    # Compute the log ratio
    mutate(logratio = log2(she / he)) |>
    # Sort 
    arrange(desc(logratio))  

Note: This code is very specific for this. I’m happy to talk through what this code means, but it’s also fine not to worry about it.

Question 18

Based on the log-ratios, what word is most likely to come after “she” rather than “he”?

Question 19

Based on the log-ratios, what word is most likely to come after “he” rather than “she”?

We can also create a graph of the log ratios to visualize this. Let’s look at the top 15 words most likely to be associated with “she”, and the top 15 words most likely to be associated with “he”.

Note: Yep, this is a lot of code. It includes some fun formatting to make the graph look cool, but you are not responsible for knowing how all of this works! As long as you can change the labels, you are set.

word_ratios |>
    mutate(abslogratio = abs(logratio)) |>
    group_by(logratio < 0) |>
    slice_max(abslogratio,n = 15) |>
    ungroup() |>
    mutate(word = reorder(word2, logratio)) |>
    ggplot(aes(word, logratio, color = logratio < 0)) +
    geom_segment(aes(x = word, xend = word,
                     y = 0, yend = logratio), 
                 size = 1.1, alpha = 0.6) +
    geom_point(size = 3.5) +
    coord_flip() +
    labs(x = NULL, 
         y = "Relative appearance after 'she' compared to 'he'",
         title = "Words paired with 'he' and 'she' in Jane Austen's novels")+
    scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) +
    scale_y_continuous(breaks = seq(-3, 3),
                       labels = c("0.125x", "0.25x", "0.5x", 
                                  "Same", "2x", "4x", "8x"))

Question 20

Based on the plot above, describe what words seem to be associated more with “she” and what words seem to be more associated with “he” in Jane Austen’s novels.

TFIDF

We have seen that we can use log-ratios to compare how often a second word in a bigram occurs for a certain first word, like “she” or “he”. If we want to see which bigrams occur most often in each book, we can also use the TF-IDF.

For instance, to find the TF-IDF of each word in each book, we use:

bigram_tfidf <- tidy_Austen_bigrams|>
  # Count the number of times each bigram appears in each book
  count(book, bigram) |>
  # Compute the TF-IDF for each bigram in each book
  bind_tf_idf(bigram, book, n) |>
  # Arrange them in descending order
  arrange(desc(tf_idf))

We can then make a plot to display the top 10 bigrams per book in terms of TF-IDF.

bigram_tfidf_top10 <- bigram_tfidf |>
  # Group by book
  group_by(book) |>
  # Choose the top 15 words within each book
  # by TF-IDF
  slice_max(tf_idf, n = 15) |>
  ungroup()
  
# Make the plot 
ggplot(bigram_tfidf_top10, aes(tf_idf, fct_reorder(bigram, tf_idf), fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

You will notice here that there are some stop words in the bigrams. This means that unlike TFIDF for words, in bigrams we still sometimes end up with stop words in the top few bigrams using TFIDF.

References

Data

Silge J (2022). janeaustenr: Jane Austen’s Complete Novels. R package version 1.0.0, https://CRAN.R-project.org/package=janeaustenr.

Code

This analysis and code was adapted from Chapter 4 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson (https://www.tidytextmining.com/ngrams). The book was last built on 2024-06-20.

Gender Analysis

The gender analysis and code was adapted from . Citation: Silge, Julia. “Gender Roles with Text Mining and N-grams”. Created April 15, 2017. URL: https://juliasilge.com/blog/gender-pronouns/