STA 279 Lab 7
Complete all Questions.
The Goal
In class, we have been learning about topic modeling. Today, we are going to try out topic modeling in R. We are going to do this with a data set where we know what the topics are so we can see how well the model is able to find those topics. We’ll also see how to visualize the output of topic modeling so we can actually use the results.
We will need the following libraries:
The Data Set
We will be working with a data set called bookchapters
containing 39870 lines of text. These lines of text belong to 109
different chapters, and we will consider each chapter a document today.
This means \(D = 109\).
bookchapters <- read.csv("https://www.dropbox.com/scl/fi/893xtsy1n3f1c97kqj3dk/bookchapters.csv?rlkey=vwfebejc0wqm61vb1020z5vkm&st=xzek55zn&dl=1")
Note: For those into literature, I will note that some of the books were divided into scenes, some into books, and some into chapters. For the purpose of this analysis, we treated all of those like chapters.
Modeling Goals
Our data come from 4 different books. For our analysis today, we will be using topic modeling to try to identify which book each of the different chapters \(d\) come from. This means that \(T=4\). For each document \(d\), we will then be estimating
\[\gamma_d = \left( \gamma_{1(d)} ,\gamma_{2(d)}, \gamma_{3(d)} , \gamma_{4(d)} \right)\]
The hope is that the model will place the highest probability \(\gamma_{t(d)}\) on the topic \(t\) that represents the correct book.
Question 1
Remind me, if we let \(d = 1\), what does \(\gamma_d\) tell us? In other words, explain to me in words what \(\gamma_1\) represents.
We estimate \(\gamma_d\) for each document \(d\) as part of LDA. We will also be estimating \(\beta_t\) for each topic \(t= 1, 2, 3, 4\).
\[\beta_t = \left( \beta_{1(t)} ,\beta_{2(t)}, \dots, \beta_{W(t)} \right)\]
Here, \(W\) represents the number of unique words in the data set after removing stop words.
Question 2
Remind me: if we let \(t = 1\), what does \(\beta_t\) tell us? In other words, explain to me in words what \(\beta_1\) represents.
To know how long the vector \(\beta_t\) will be for our data, we need to find \(W\).
Question 3
Tokenize the text in bookchapters
into words without
removing the stop words. State how many unique words are in the data
set.
Question 4
Tokenize the text in bookchapters
into words and remove
the stop words. Store the results in an object called
bookchapters_words
.
State how many unique words (after removing stop words) are in the data set.
State how many words you were able to remove from our vocabulary when you removed the stop words. In other words, how much smaller is (a) than your answer from Question 3?
When we do topic modeling, removing stop words is important for two reasons. The first is that if we do not remove stop words, they sometimes become a topic of their own in the modeling process. As stop words do not provide us with content information about the books, this isn’t helpful.
The second reason we remove the stop words is more of a practical one. In topic modeling, we have to model the probability of seeing each unique word in the data set for each topic. As we can see in Question 4, removing the stop words decreases the number of words we have to model, which helps the model converge faster and take less time to run.
So, our goals: For each of the 4 topics we are going to estimate the probability vector \(\beta_t\).
\[\beta_t = \left( \beta_{1(t)} ,\beta_{2(t)}, \dots, \beta_{17363(t)} \right)\]
For each of the \(D = 109\) documents, we are going to estimate the vector \(\gamma_d\).
\[\gamma_d = \left( \gamma_{1(d)} ,\gamma_{2(d)}, \gamma_{3(d)} , \gamma_{4(d)} \right)\]
When we are done, we are going to look at the \(\gamma_d\) vector for each document \(d\) and see if the highest probability is associated with the correct book.
Now that we know our goals, let’s do this in R.
Modeling
Before we can use LDA to model the topics in the data, we have to reformat the data a little. The code we will be using requires a specific type of data input for the code to run.
The first step is to count the number of times that each word occurs
in each document. We then convert the data into what is called a
document-term matrix. The final result is an object
called chapters_dtm
that we will use for modeling.
chapters_dtm <- bookchapters_words |>
count(document, word, sort = TRUE) |>
cast_dtm(document, word, n)
Basically, this is a structure that is convenient for the computer to work with as it goes through the updating process necessary for LDA. It is not at all convenient for us to work with otherwise, but for LDA we need to use it.
Though the process of LDA has many steps, the code to run it requires only one line:
This code has 4 inputs:
chapter_dtm
: this is our formatted data.k = 4
: This is \(T\), the number of topics.control = list(seed = 1234)
: this sets a random seed of 1234.
What is a random seed??
Random Seeds
As part of LDA, we know that we find the probability that each word \(i\) in a document \(d\) comes from a certain topic. For instance, we could have:
\[\gamma_1 = \left( .2, .1, .7, .1 \right)\]
This means that there is a 20% probability that a word \(i\) in document 1 comes from Topic 1, a 10% probability it comes from Topic 2, a 70% probability it comes from Topic 3, and a 10% probability it comes from Topic 4.
Once we find these probabilities, part of the model involves sampling a topic \(t_{i1}\) for each word \(i\) in document 1 using these probabilities. We can do that using this code:
This function has three inputs:
1:4
: this tells the computer what our options are for sampling. In this case, it is the topics 1-4.1
: this tells the computer how many topics we want to sample. Each word \(i\) can have only one topic, so we choose 1.prob =
: this is a vector that tells the computer the probability of sampling each of the topics.
When you run the sample code above, you get a number 1, 2, 3, or 4.
Great…however, suppose you close your Markdown file and come back to it
later. We want the computer to choose the SAME topic when you run your
code again. Otherwise, you would get completely different results each
time you drew a random sample! The job of theset.seed()
function is to do just that. It ensures that each time you run your
chunk, you will get the same random sample. Let’s try that.
Question 5
Create a code chunk, paste in this code, and press play:
sample( 1:4, 1, prob = c(.2,.3,.3,.2))
What number do you get? In other words, what topic did you assign to the word?
Now, hit play again. What number did you get now?
You should notice that every time you hit play on this chunk, the result can change. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seed to fix this problem.
Question 6
Now, add the line set.seed(279)
to the beginning of your
code chunk from the previous question (meaning this line needs to come
before the sample
command.) Hit play on the chunk.
What number did you get?
Now, hit play again. What number did you get now?
Note: You will note that I used 279, but in practice you can use literally any positive integer you want as your random seed.
Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.
Setting a random seed (which is what we will call using the
set.seed()
command) will prove very useful for any kind of
random sampling we do in this course.
Back to the model: Word Probabilities
Recall that we were talking about the one line of code we need to run LDA before we got onto the tangent about random seeds:
There are two things we want to retrieve from this output: our estimates of \(\gamma\) and our estimates of \(\beta\). We will start with \(\beta\).
In order to retrieve the estimates of \(\beta\) from LDA, we can use the following:
Question 7
Look at the first 4 rows in beta_estimates
. Interpret
each of the 4 \(\beta\) values you
see.
Hint: 1e-03 = .001
Question 8
How many times larger is the probability that the word ``people” will occur in Topic 3 versus Topic 1?
Question 9
You will notice that these 4 probabilities do not add to up to one, nor do they have to! Explain why these 4 probabilities do not have to add up to one, even though we have 4 topics.
The \(\beta_t\) terms show us the probability of getting each of \(W\) words in documents from topic \(t\). This means that one way to determine the words that make up topic \(t\) is to look at the words that have the highest probability. In other words, we look at the words in \(\beta_t\) that have the highest probability for each topic \(t\).
To do this, we can use code that should be very familiar to us to by now.
top_terms <- beta_estimates |>
group_by(topic) |>
slice_max(beta, n = 10) |>
ungroup() |>
# Order the data in decreasing beta
# order within each topic
arrange(topic, -beta)
Question 10
Annotate the code in the chunk above. In other words, briefly explain in words what each of the code above does. I’ve done the final line for you!
Question 11
What word has the highest probability of occurring in Topic 3?
Another way we can visualize the terms defining each topic is by using a graph. Again, this code should be familiar.
# Arrange the data so facet wrap keeps them in order
top_terms_plot <- top_terms |>
mutate(term = reorder_within(term, beta, topic))
# Create the plot
ggplot(top_terms_plot,aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Question 12
What book do you think the terms in Topic 4 come from?
Topic Probability Estimates
At this point, we can see the words that seem to define each topic. We hope each of these topics relates to one of the 4 books in our data set. The next step is to see if the model was able to accurately separate the chapters into books. In other words, did the model actually create topics for each book, and did it do so correctly?
To determine all of this, we need to switch gears from estimating \(\beta\) to estimating \(\gamma\).
Question 13
Suppose a chapter of text actually came from the book Emma. Based on the plot in the previous section, what would an example of a reasonable estimate of \(\gamma_d\) look like for this chapter?
In order to retrieve the estimates of \(\gamma\), we can use the following:
Let’s look at the first document, which is the first chapter in the book The War of the Worlds.
Question 14
Which topic has the highest probability for this document? Look back at the plot above Question 10 that shows the words that are related to this topic. Does it look like this topic is indeed related to the correct book?
Note: The probabilities are rounded, so a probability of 1 is not exactly 1.
Okay, great!! This means that for each document, we can take a look and see if the highest probability in \(\gamma_d\) is the correct book!
There is only one small problem right now, and it has to do with how
one of the columns is named. Right now, you’ll notice the first document
is named The War of the Worlds_1
. This is because this
document is the 1st chapter in the book The War of the
Worlds.
Our task is going to be to see if the model correctly identified the books the chapter came from. This means it would be helpful to have just the book title. To clean up the document names, we can use the following:
gamma_estimates <- gamma_estimates |>
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
This changes the title column to just indicate which book a document came from.
There are a few ways we can go about seeing how well our model was able to separate documents into books. One option is to look at all the documents (chapters) in each book and find the \(\gamma\) vector for each one. We can then take the average of all these vectors to determine which topics the documents tended to be placed in.
Again, recall that the hope is that the \(\gamma_d\) vector will place most of the probability on the topic that is actually associated with the correct book.
Question 15
What is the average \(\gamma_d\) vector for Hamlet? Does this indicate that that model was generally able to correctly identify chapters as belonging to Hamlet?
We can also visualize the information in a graph.
gamma_estimates |>
mutate(title = reorder(title, gamma * topic)) |>
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ title) +
labs(x = "topic", y = expression(gamma))
What we can see from the graph and the table output is that for each book, the highest probability is associated with a different topic. This means that the model was able to correctly separate into the 4 different books!
Choosing T
Now, all of this relied on us knowing that there were 4 topics (4 books), so we could correctly choose \(T=4\). In practice, there are a few different tools that we can use to choose \(T\). One possibility is to choose a large \(T\), like \(T\) = 10. Once we see the words defining the topics, we may be able to see that some topics can be combined, which means we can train the model again using a lower choice of \(T\). There are also versions of LDA that add an extra layer to the model that estimates \(T\). There are a lot of choices depending on our modeling goals!
References
Data
Johnston M, Robinson D (2023). gutenbergr: Download and Process Public Domain Works from Project Gutenberg. R package version 0.2.4, https://CRAN.R-project.org/package=gutenbergr.
Code
This analysis and code was adapted from Chapter 6 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson (https://www.tidytextmining.com/topicmodeling). The book was last built on 2024-06-20.