Topic Models and Low-Dimensional Embeddings

Clustering, Review

In the last module, we used clustering methods to assign each document in a collection to a single cluster of, allegedly, similar documents.

Presumably, these clusters would contain documents covering a similar mix of topics or issues.

But, documents can typically belong to only one cluster, so there are some limits to this analysis.

Yes, it is possible to use probabilistic clustering, but even so, there is a “most probable” cluster.

But aren’t documents sometimes about more than one thing?

What would you do with documents that you know contain multiple topics, and you might like to know what topics they cover?
This is not really a “clustering” problem, but rather a content analysis problem.
Example: political platforms - what you might like is not which platforms are most similar, but rather a summary of what topics a platform covers, and a list of all topics from all platforms.

Enter the topic model

Topic models represent documents as mixtures of topics, with each document composed of varying proportions of topics.
Allows you to find “themes” in a bunch of documents that you might not be able to easily pull out meaning from.

Extremely idealized topic modeling output

Feed in 100 documents, indicate that you believe there are 4 topics
Algorithm will kick back to you words that are most associated with each of the 4 topics
Algorithm will kick back to you, for each document, the proportions of topics per document
So, let’s say you gave the algorithm 100 documents: 25 recipes, 25 constitutions, 25 albums of Bruce Springsteen song lyrics, and 25 R programming books
- The topic modeling tool should generate the most common words associated with cooking, constitutions, Bruce albums, and R programming
- It would also kick back the amount of each of these documents that is devoted to these topics (presumably, pretty high proportions of the documents would be assigned to one of the four topics, because these are very different topics with different vocabularies)
- As we will see, this is pretty much never how it works in practice

What is the math?

Topic modeling works using the method of latent dirichlet allocation.

Mathematically a little complicated and we won’t get into the specifics, but it builds on the multinomial models we introduced earlier in the course
You construct documents from “bags of bags of words”

LDA Intuition - Picking topics

When an author drafts a document, she first draws from a set of topics to include in the document. The possible topics have a set of probabilities/weights associated with them.
- I am more likely to write about politics than I am about bio-engineering.

LDA Intuition - Picking words

Each topic has a probability distribution over all possible words associated with it.
The word “elections” is more likely if I am writing about politics than bio-engineering.

LDA Intuition - Writing (really weird)

LDA imagines that we generate words one by one after another, first picking a topic for that word, and then picking a word associated with that topic.

We go word by word until we have an entire document.
This would be a crazy way to write, but it seems to work ok-ish as a generative model for how topics guide the words we find in documents.

Statistics - a data generating process

If you are inclined, you can go back and think about this more rigorously from the multinomial model of language framework.
Topics and words are both drawn from respective multinomial distributions.
You assume a prior distribution for topic and word weights, with a Dirichlet distribution (a bunch of probabilities that sum to 1).

Estimation

The details of this are out of scope for us but well understood by the experts.
The short story is that you can backwards engineer the parameters for the model, based on inferring from the observed distribution of words in your documents.
Those parameters constitute the probabilities for words being from topics and the proportion topics in documents.
- Going back to the start, that is what you would need to know in order to create a document - the weights of topics in documents (gamma values) and the weights of words in each topic (beta values).
- The parameter values are optimized to make it likely to generate the observed documents.

Interpretation

It does not tell you how many topics are in a corpus. You as the analyst have to make a choice about how many topics to model.
It also does not tell you what those topics are, substantively. You have to interpret that.
Topic models provide you with a measure of how associated a given topic is with each document (the “gamma” values).
Topic models give you list of words and how associated they are with a topic (the “beta” values).

How many topics do I pick?

One outstanding issue is that, like with clustering, it is incumbent upon the analyst to pick at the start how many topics the algorithm should try to organize the documents into. This is a high stakes decision, and there is really no guidance on how to do this best.
You just have to validate your results through close inspection.

But don’t I know something about that documents? Couldn’t that help me?

Often, we know something about the documents - subject matter, date of publication, author characteristics, etc.
Indeed, there are some ways to do this. Essentially what you are doing with these methods is letting document-level covariates structure topic prevalence in documents.
- Example: we could load some data into our model about the publication date of articles, and this could affect the performance of the LDA modeling. Some topics might be more prevalent at certain times of year, and adding this to the model would inform you results.
This is called structural topic modeling and is implemented by the package stm in R.
The details of this are somewhat out of scope. If you think you have relevant document-level covariates, this might be worth investigating, but “unstructured” topic modeling is fine as you are getting to know this method in this class.

Does this work for short texts?

Not really. You need documents with enough words that they can meaningfully inform the parameter estimates.

So applying this to individual tweets would be not be so great.
You could aggregate up and look at all posts/writing from a user within a certain time period, for instance.

Working our way through Silge

(Look the the .qmd)

STM

STM is a little more complex in terms of coding, because relies on quanteda.
The vignette is good (https://cran.rstudio.com/web//packages/stm/vignettes/stmVignette.pdf) - about the first 50% is easy to understand, then a little harder.
The goal of this modeling endeavor is to do topic modeling on a number of political blog entries.
Useful metadata about these blogs includes the political ideology of the blogger (which is known) and the date of publication of the post. These could well have systematic effects on what topics are included in the blog post.

STM key code

The key part of the documentation is page 9

poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, + K = 20, prevalence =~ rating + s(day), + max.em.its = 75, data = out$meta, + init.type = "Spectral")

See how prevlance is a function of rating plus day? This is the telling the model to incorporate these variables into the document-level topic prevalence parameter estimation (the gamma values).

Topic Modeling Wrap Up

Topic models are a helpful form of unsupervised learning that can help you get a sense of the content of documents.

It still requires a lot of validation.

On to PCA and dimensionality reduction

So many dimensions…

Clustering and LDA are basically ways to group documents, and then you as the user can figure out what the clusters/topics are about.
But, clustering is in many dimensions, and LDA can include many topics.
What if you want a way to simplify your understanding of documents?
For instance, what if you wanted to place documents on something like a left-right political ideology scale?

Dimensionality reduction with PCA

Dimensionality reduction is a statistical technique for “compressing” data contained in many covariates, while retaining as much distinguishing information as possible.
You often read about this in the context of ML model building, where you might have hundreds or thousands of related covariates, and you might get better model performance if you compress those super wide data sets into a smaller number of summary scores.
This is covered in greater detail in the machine learning course.

Very general intuition

Let’s “map” multidimensional data onto a lower dimensional “component”.¹
This means finding a point on a line that is closest to the that point, intersecting that line at a 90 degree angle.

A picture of mapping

The “best” line is one that has smallest possible total distance from all points (the one in the “middle”)

What is PCA doing?

PCA functions will find these lines for you and tell you the relative position of each point on this line, which becomes a “principal component score”

Often there will multiple PCs that capture meaningful variation in your data

PCA also gives you a principal component loading, which in this context tells you how that word is associated with that latent variable (for instance, if you were measuring ideology from low to high, conservative to liberal, high loadings for that word would suggest it is “liberal” word)

High level steps of PCA

Clean your data (remove punctuation, stopwords, etc.).
Create a DTM.
Center and rescale! Very important and included in all tutorials.
Use a R/Python package to calculate PCs.
- Use one of several methods (usually graphical) to determine the right number of PCs (scree plot).
Assign a “scale” position to documents using a document’s PC score.
Interpret what the PC means? (next slide)

Issues with PCA and Other Scaling Tools

You still have to interpret what the components mean, which puts a lot of onus on the researcher.
A common approach would be to look at PC loadings for certain words - those words with very positive/very negative loading would give you a sense of the scale meaning.
You could also sample documents with very positive/very negative PC scores.

The problem continues to be that you have to do all of this yourself. Validate!

Other tools for scaling?

You could use document embeddings with LLMs. You could then use PCA on the document embeddings.
There are some political science-specific applications of scaling related to ideology (look up wordscores if you are interested, Benoit and co-authors).
There is emerging research about how to use LLMs for this task too.
- What if you established some “ground truth” examples of a spectrum you believe is present in your documents, and you wanted to compare a set of new documents to your ground truth?
- You might be able to prompt an LLM to do this, or you could do something like cosine similarity with the pre-trained embeddings from one of the big models like OpenAI.
- I’m not expert on this, but I suspect this going to more of the wave of the future.

A synthesis of topic models, clustering, and PCA?

BERTopics by Grootendorst

Embed your documents into vector space using pre-trained embeddings
Collapse the vectors (reduce dimensionality; BERTopics using something a little different called UMAP)
Cluster your documents in lower dimensional space (BERTopics uses something different called DBSCAN)
Weight the terms in the documents (basically tf-idf)
Maybe feed documents and keywords into an LLM and ask for a label?

BERTopics References

(check the .qmd for many links hidden in the code block)

Special acknowledgment to this excellent introductory blog from Kevin Reuning, with a version of the bullets from the last slide.
But be warned, even Grootendorst will advise that this isn’t necessarily “better” than LDA. It just depends on your use case.

This is a Python-native package, but there is a R wrapper (included in the hidden links and on the resources page).

What’s next?

We’re done with unsupervised learning. Next up, sentiment analysis and using LLMs for NLP tasks.