STA 279 Lab 6
Complete all Questions.
The Goal
So far in this course, we have seen a lot of things we can do with individual words. However…words connect to other words, and sometimes when that happens their meanings change. Today we are going to start talking about n-grams, which means \(n\) words in a phrase. This means that instead of working with single words all the time, we will have the ability to analyze multiple word phrases.
The Data Set
As sentiment analysis is designed to measure emotion, one common application is analyzing books, movies, or other media text. Today, we are going to work with a collection of books written by Jane Austen.
To load the data, use the following:
# Load the libraries
library(janeaustenr)
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)
books <- austen_books()
books <- books |>
group_by(book) |>
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
books <- books[-which(books$text == ""),]The data set books has \(n=62279\) rows and 2 columns. The first
column text is part of the text of the book, and the second
column book tells us which of Jane Austen’s books the text
came from.
Question 1
How many unique Jane Austen books are included in this data set?
Question 2
What text is present on Line 4 of this data set?
This is the first time we’ve worked with text that is longer than a few lines! The words in the books were input line by line for each book. Looking at the data, we can see that this means blank lines were included, as well as things like chapter numbers.
Question 3
Create tidy_books by tokenizing books into
words. Do not remove stop words or punctuation. How many rows are in the
resultant data set tidy_books?
Typically, our first step is to tokenize our text into words. However, we can actually tokenize text by dividing into phrases instead of single words. We are going to start by tokenzing our text into bigrams. Bigrams are two-word phrases, like “good day” or “hard work”.
Tokenzing our text into bigrams uses a very similar code to what we are used to, but there are some differences.
# Tokenize the text into bigrams
tidy_Austen_bigrams <- austen_books() |>
unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
filter(!is.na(bigram))The first difference we notice is in the second line of code where we
tokenize the text. The function unnest_tokens is the same,
but instead of only two inputs like we have when we tokenize into words,
tokenizing into bigrams requires four inputs.
bigram: This tells R to create a column calledbigramto store the tokenized text.
text: This tells R which column we are tokenzing.
token = "ngrams": This tells R we will tokenzing by breaking the text into phrases rather than words.
n=2: This tells R specifically to tokenize into 2 word phrases, i.e., bigrams.
There is also a brand new line of code:
filter(!is.na(bigram)). To figure out what this line of
code does, let’s see what happens if we leave it out.
Question 4
Run the above code for tokenizing the text into bigrams, but remove
the filter(!is.na(bigram)) line of code.
Look at tidy_Austen_bigrams (but do NOT print it in your
Markdown file). Once you have done this, put the
filter(!is.na(bigram)) line of code back and run the code
again.
As your answer to this question, tell me what that
filter(!is.na(bigram)) line of code does!
Question 5
Write a code to tokenize the text into trigrams, which means 3 word phrases. As the answer to this question, tell me how many trigrams are in the Jane Austen data set.
What we have just seen is that the code we are working with today can be extended to more than just bigrams! We can look at 3 word phrases, 4 word phrases, etc. Typically in practice, word, bigrams, and trigrams tend to be the most useful, but we have options!
For now, let’s focus on bigrams. We have explored the code for how to
tokenize the text into bigrams, but what does tokenizing the text of the
books into bigrams look like? To find out, let’s look at the first few
rows in tidy_Austen_bigrams.
| book | bigram |
|---|---|
| Sense & Sensibility | sense and |
| Sense & Sensibility | and sensibility |
| Sense & Sensibility | by jane |
| Sense & Sensibility | jane austen |
| Sense & Sensibility | chapter 1 |
| Sense & Sensibility | the family |
Let’s look at the first two bigrams: “sense and” and “and sensibility”. This is the bigram version of the phrase “sense and sensibility” which is the title of the book.
This means that when we separate into bigrams, we include all possible two word phrases. We don’t keep “sense and” together and then start the second bigram with “sensibility”. The only time this is not true is when the data has blank lines separating text, as it does between the title “sense and sensibility” and the author line “by Jane Austen”, and the author line “by Jane Austen” and the first chapter title.
It turns out that we can do very similar things with n-grams that we can with individual words. Let’s start with counting, TF-IDF, and stop words.
Counts and Stop Words
One of things we can do with n-grams is count the number of times each n-gram appears in a text. Just like when we do this with words, this can help us determine what the text is all about.
If we wanted to count the number of times each word appears, we use:
We can use a very similar code to count the number of times each bigram appears:
Question 6
Print out the 10 bigrams that occur most frequently by using
knitr::kable(count_bigrams[1:10, ]). What do you notice
about these bigrams?
Ah, stop words!! The answer to Question 6 is our first hint that some of the issues that we saw when worked with words can also occur with n-grams. We are going to handle these issues in a very similar way, but the code will look a little different.
We know we can handle stop words by removing them, or by using a tool like the TF-IDF that effectively filters out the stop words by down weighting them. We will try both, but let’s start by thinking about how to remove stop words.
When we tokenized the text into words, we used anti_join
to remove all words that were not stop words. Unfortunately, we can’t
just remove the stop words before we tokenize when we work with n-grams.
This is because removing stop words might make things bigrams that were
not in the original text.
For example consider the phrase “were not in the original text”. If we removed the stop words “were”, “in” and “the”, we would get “not original text”. This means that when we tokenized into bigrams, we would get “not original” and “original text”. The second bigram is a bigram we had in the original phrase, but the first is not.
Instead of removing all the stop words before we tokenize, we are going to tokenize the text into bigrams first. We are then going to remove all of the bigrams that contain at least one stop word. This is going to require a few steps.
Step 1: We break up our bigrams into two words. The
command separate is going to help us do this. It requires
three inputs:
- what we want to break apart (the column
bigrams)
- what we want to break apart (the column
- what new columns we want to create (
word1andword2)
- what new columns we want to create (
- how the computer should tell when to break apart the bigrams. Our
bigrams have a space in between each word, so a space (
" ") is where one word stops and another starts!
- how the computer should tell when to break apart the bigrams. Our
bigrams have a space in between each word, so a space (
This means we will use this code:
Question 7
What is the content of row 8 in bigrams_separated?
Question 8
Show the code you would need to break up the trigrams from Question 5 into three words.
Step 2: Once we have broken the bigrams into two
words, we can remove any rows where the first word or the
second word is a stop word. We know that we can use filter
to choose any rows that have specific qualities we want to keep. It
turns out we can also use filter to remove rows that we
don’t want! Here, the ! in the code tells R to exclude rows
that are stop words.
bigrams_clean<- bigrams_separated |>
# Remove any rows where word 1 is a stop word
filter(!word1 %in% stop_words$word) |>
# Remove any rows where word 2 is a stop word
filter(!word2 %in% stop_words$word)Question 9
Show the code you would need to remove the stop words from the trigrams from Question 8.
At this point, we have removed all the bigrams that include stop
words. However, our bigrams are currently broken up into two individual
words. To actually use the bigrams for analysis, we have to put the
words back together again! We use unite for this.
The inputs are:
bigram: what new column do we want to create
word1and (3)word2: what columns do we want to combine
sep = " ": include a space between our two words in the bigram when we put them together
Question 10
Show the code you would need to re-unite the trigrams you split in Question 9.
So, we can remove stop words as we did with words, though it does take a little more work. At this point, the next step is to determine how often each of these bigrams occurs in the text.
Question 11
Create an object called bigrams_count that counts the
number of times each bigram occurs in the data set, now that we have
removed stop words. Print out the 10 bigrams that occur most frequently
(after removing stop words).
Question 12
Create a plot (geom_col) to show the ten most frequently
occurring trigrams after removing the stop words. Make
sure to label your axes.
Gender Analysis
So far, we have seen that we can tokenize the text into n-grams (bigrams and trigrams so far). We can also remove stop words and count the number of n-grams that occur most frequently. This is all things we could also do when we tokenized text into words.
However, there are things we can do with n-grams that we could not with single words. For instance, what if we were curious what actions in Jane Austen’s books were associated with men vs women? In other words, what did “he” versus “she” do in her books?? This analysis was performed by Julia Silge on her blog, and we are going to replicate it!
In Jane Austen’s time, “she” and “he” were the pronouns primarily used to designate men and women. If we want to compare male and female actions, this means we are looking for things like “she said” or “he did”. This means we need to find all the bigrams with the first word being either “he” or “she”.
The first step is to store the key words we are looking for.
We have already seen that we can break up our bigrams into two words,
as we just did when we created bigrams_separated before
removing the stop words. However, in this case instead of removing all
bigrams that contain stop words, we want to keep all bigrams that start
with “he” or “she”.
To do this, we can use the following:
she_and_he_bigrams <- bigrams_separated |>
# Keep only bigrams where the first word
# is she or he
filter(word1 %in% pronouns)Notice that we are working with bigrams_separated, which
means we have not removed stop words. This is important here because
“she” and “he” are considered stop words.
We can add one more line to count the number of times each bigram appears.
she_and_he_bigrams_count <- bigrams_separated |>
# Keep only bigrams where the first word
# is she or he
filter(word1 %in% pronouns)|>
# Count the number of times each bigram appears
count(word1, word2, sort = TRUE) Question 13
Using the code above (so don’t remove stop words), how many unique bigrams start with “he”?
Question 14
Which bigram starting with “she” happens most often?
Question 15
Which bigram starting with “he” happens most often?
At this point, we would really like to see which words are associated more often with “she” than “he” in bigrams in Jane Austen, and vice versa. This allows us to describe things that tend to be associated with women vs. men in her books.
To do this, we are going to look only at the word2
column (the 2nd word in the bigram). We are then going to find the
log-ratio of she vs. he for each word. This means we
are going to compute:
\[log\left( \frac{\text{Number of Times this word comes after she}}{\text{Number of Times this word comes after he}}\right) \]
We use the log because this fraction can get very small, and using logs makes smaller numbers easier to work with.
Question 16
Suppose the log ratio is negative. Does this means that the word is more often associated with “she” or “he”? Explain briefly.
Question 17
Suppose the log ratio is positive. Does this means that the word is more often associated with “she” or “he”? Explain briefly.
Like when we work with the TF-IDF, we will not remove stop words when we work with log-ratios. The idea is that stop words are likely to happen with both she and he, so they should have values of the log-ratio that are close to 0.
To find the log-ratio in R, we can use the following. Note: You will get a warning. Ignore it!
word_ratios <- she_and_he_bigrams_count |>
# Filter to words that occur at least 10 times with he or she
group_by(word2) |>
filter(sum(n) > 10) |>
ungroup() |>
spread(word1, n, fill = 0) |>
mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) |>
# Compute the log ratio
mutate(logratio = log2(she / he)) |>
# Sort
arrange(desc(logratio)) Note: This code is very specific for this. I’m happy to talk through what this code means, but it’s also fine not to worry about it.
Question 18
Based on the log-ratios, what word is most likely to come after “she” rather than “he”?
Question 19
Based on the log-ratios, what word is most likely to come after “he” rather than “she”?
We can also create a graph of the log ratios to visualize this. Let’s look at the top 15 words most likely to be associated with “she”, and the top 15 words most likely to be associated with “he”.
Note: Yep, this is a lot of code. It includes some fun formatting to make the graph look cool, but you are not responsible for knowing how all of this works! As long as you can change the labels, you are set.
word_ratios |>
mutate(abslogratio = abs(logratio)) |>
group_by(logratio < 0) |>
slice_max(abslogratio,n = 15) |>
ungroup() |>
mutate(word = reorder(word2, logratio)) |>
ggplot(aes(word, logratio, color = logratio < 0)) +
geom_segment(aes(x = word, xend = word,
y = 0, yend = logratio),
size = 1.1, alpha = 0.6) +
geom_point(size = 3.5) +
coord_flip() +
labs(x = NULL,
y = "Relative appearance after 'she' compared to 'he'",
title = "Words paired with 'he' and 'she' in Jane Austen's novels")+
scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) +
scale_y_continuous(breaks = seq(-3, 3),
labels = c("0.125x", "0.25x", "0.5x",
"Same", "2x", "4x", "8x"))Question 20
Based on the plot above, describe what words seem to be associated more with “she” and what words seem to be more associated with “he” in Jane Austen’s novels.
TFIDF
We have seen that we can use log-ratios to compare how often a second word in a bigram occurs for a certain first word, like “she” or “he”. If we want to see which bigrams occur most often in each book, we can also use the TF-IDF.
For instance, to find the TF-IDF of each word in each book, we use:
bigram_tfidf <- tidy_Austen_bigrams|>
# Count the number of times each bigram appears in each book
count(book, bigram) |>
# Compute the TF-IDF for each bigram in each book
bind_tf_idf(bigram, book, n) |>
# Arrange them in descending order
arrange(desc(tf_idf))We can then make a plot to display the top 10 bigrams per book in terms of TF-IDF.
bigram_tfidf_top10 <- bigram_tfidf |>
# Group by book
group_by(book) |>
# Choose the top 15 words within each book
# by TF-IDF
slice_max(tf_idf, n = 15) |>
ungroup()
# Make the plot
ggplot(bigram_tfidf_top10, aes(tf_idf, fct_reorder(bigram, tf_idf), fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)You will notice here that there are some stop words in the bigrams. This means that unlike TFIDF for words, in bigrams we still sometimes end up with stop words in the top few bigrams using TFIDF.
References
Data
Silge J (2022). janeaustenr: Jane Austen’s Complete Novels. R package version 1.0.0, https://CRAN.R-project.org/package=janeaustenr.
Code
This analysis and code was adapted from Chapter 4 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson (https://www.tidytextmining.com/ngrams). The book was last built on 2024-06-20.
Gender Analysis
The gender analysis and code was adapted from . Citation: Silge, Julia. “Gender Roles with Text Mining and N-grams”. Created April 15, 2017. URL: https://juliasilge.com/blog/gender-pronouns/