In the last module, we used clustering methods to assign each document in a collection to a single cluster of, allegedly, similar documents.
But, documents can typically belong to only one cluster, so there are some limits to this analysis.
What would you do with documents that you know contain multiple topics, and you might like to know what topics they cover?
This is not really a “clustering” problem, but rather a content analysis problem.
Example: political platforms - what you might like is not which platforms are most similar, but rather a summary of what topics a platform covers, and a list of all topics from all platforms.
Topic models represent documents as mixtures of topics, with each document composed of varying proportions of topics.
Allows you to find “themes” in a bunch of documents that you might not be able to easily pull out meaning from.
Feed in 100 documents, indicate that you believe there are 4 topics
Algorithm will kick back to you words that are most associated with each of the 4 topics
Algorithm will kick back to you, for each document, the proportions of topics per document
So, let’s say you gave the algorithm 100 documents: 25 recipes, 25 constitutions, 25 albums of Bruce Springsteen song lyrics, and 25 R programming books
The topic modeling tool should generate the most common words associated with cooking, constitutions, Bruce albums, and R programming
It would also kick back the amount of each of these documents that is devoted to these topics (presumably, pretty high proportions of the documents would be assigned to one of the four topics, because these are very different topics with different vocabularies)
As we will see, this is pretty much never how it works in practice
Topic modeling works using the method of latent dirichlet allocation.
Mathematically a little complicated and we won’t get into the specifics, but it builds on the multinomial models we introduced earlier in the course
You construct documents from “bags of bags of words”
When an author drafts a document, she first draws from a set of topics to include in the document. The possible topics have a set of probabilities/weights associated with them.
Each topic has a probability distribution over all possible words associated with it.
The word “elections” is more likely if I am writing about politics than bio-engineering.
LDA imagines that we generate words one by one after another, first picking a topic for that word, and then picking a word associated with that topic.
We go word by word until we have an entire document.
This would be a crazy way to write, but it seems to work ok-ish as a generative model for how topics guide the words we find in documents.
If you are inclined, you can go back and think about this more rigorously from the multinomial model of language framework.
Topics and words are both drawn from respective multinomial distributions.
You assume a prior distribution for topic and word weights, with a Dirichlet distribution (a bunch of probabilities that sum to 1).
The details of this are out of scope for us but well understood by the experts.
The short story is that you can backwards engineer the parameters for the model, based on inferring from the observed distribution of words in your documents.
Those parameters constitute the probabilities for words being from topics and the proportion topics in documents.
It does not tell you how many topics are in a corpus. You as the analyst have to make a choice about how many topics to model.
It also does not tell you what those topics are, substantively. You have to interpret that.
Topic models provide you with a measure of how associated a given topic is with each document (the “gamma” values).
Topic models give you list of words and how associated they are with a topic (the “beta” values).
One outstanding issue is that, like with clustering, it is incumbent upon the analyst to pick at the start how many topics the algorithm should try to organize the documents into. This is a high stakes decision, and there is really no guidance on how to do this best.
You just have to validate your results through close inspection.
Often, we know something about the documents - subject matter, date of publication, author characteristics, etc.
Indeed, there are some ways to do this. Essentially what you are doing with these methods is letting document-level covariates structure topic prevalence in documents.
This is called structural topic modeling and is implemented by the package stm in R.
The details of this are somewhat out of scope. If you think you have relevant document-level covariates, this might be worth investigating, but “unstructured” topic modeling is fine as you are getting to know this method in this class.
Not really. You need documents with enough words that they can meaningfully inform the parameter estimates.
So applying this to individual tweets would be not be so great.
You could aggregate up and look at all posts/writing from a user within a certain time period, for instance.
(Look the the .qmd)
STM is a little more complex in terms of coding, because relies on quanteda.
The vignette is good (https://cran.rstudio.com/web//packages/stm/vignettes/stmVignette.pdf) - about the first 50% is easy to understand, then a little harder.
The goal of this modeling endeavor is to do topic modeling on a number of political blog entries.
Useful metadata about these blogs includes the political ideology of the blogger (which is known) and the date of publication of the post. These could well have systematic effects on what topics are included in the blog post.
poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, + K = 20, prevalence =~ rating + s(day), + max.em.its = 75, data = out$meta, + init.type = "Spectral")
prevlance is a function of rating plus day? This is the telling the model to incorporate these variables into the document-level topic prevalence parameter estimation (the gamma values).Topic models are a helpful form of unsupervised learning that can help you get a sense of the content of documents.
It still requires a lot of validation.
Clustering and LDA are basically ways to group documents, and then you as the user can figure out what the clusters/topics are about.
But, clustering is in many dimensions, and LDA can include many topics.
What if you want a way to simplify your understanding of documents?
For instance, what if you wanted to place documents on something like a left-right political ideology scale?
Dimensionality reduction is a statistical technique for “compressing” data contained in many covariates, while retaining as much distinguishing information as possible.
You often read about this in the context of ML model building, where you might have hundreds or thousands of related covariates, and you might get better model performance if you compress those super wide data sets into a smaller number of summary scores.
This is covered in greater detail in the machine learning course.
Let’s “map” multidimensional data onto a lower dimensional “component”.1
This means finding a point on a line that is closest to the that point, intersecting that line at a 90 degree angle.
PCA functions will find these lines for you and tell you the relative position of each point on this line, which becomes a “principal component score”
PCA also gives you a principal component loading, which in this context tells you how that word is associated with that latent variable (for instance, if you were measuring ideology from low to high, conservative to liberal, high loadings for that word would suggest it is “liberal” word)
Clean your data (remove punctuation, stopwords, etc.).
Create a DTM.
Center and rescale! Very important and included in all tutorials.
Use a R/Python package to calculate PCs.
Assign a “scale” position to documents using a document’s PC score.
Interpret what the PC means? (next slide)
You still have to interpret what the components mean, which puts a lot of onus on the researcher.
A common approach would be to look at PC loadings for certain words - those words with very positive/very negative loading would give you a sense of the scale meaning.
You could also sample documents with very positive/very negative PC scores.
The problem continues to be that you have to do all of this yourself. Validate!
You could use document embeddings with LLMs. You could then use PCA on the document embeddings.
There are some political science-specific applications of scaling related to ideology (look up wordscores if you are interested, Benoit and co-authors).
There is emerging research about how to use LLMs for this task too.
What if you established some “ground truth” examples of a spectrum you believe is present in your documents, and you wanted to compare a set of new documents to your ground truth?
You might be able to prompt an LLM to do this, or you could do something like cosine similarity with the pre-trained embeddings from one of the big models like OpenAI.
I’m not expert on this, but I suspect this going to more of the wave of the future.
BERTopics by Grootendorst
(check the .qmd for many links hidden in the code block)
Special acknowledgment to this excellent introductory blog from Kevin Reuning, with a version of the bullets from the last slide.
But be warned, even Grootendorst will advise that this isn’t necessarily “better” than LDA. It just depends on your use case.
This is a Python-native package, but there is a R wrapper (included in the hidden links and on the resources page).
We’re done with unsupervised learning. Next up, sentiment analysis and using LLMs for NLP tasks.