In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.
The intuition behind LDA is that documents exhibit multiple topics.
For example, consider the article in Figure 1. This article, entitled “Seeking Life’s Bare (Genetic) Necessities,” is about using data analysis to determine the number of genes an organism needs to survive (in an evolutionary sense)
data analysis, such as “computer” and “prediction,” are highlighted in blue;evolutionary biology, such as “life” and “organism,” are highlighted in pink;genetics, such as “sequenced” and “genes,” are highlighted in yellow.genetics, data analysis, and evolutionary biology in different proportions.LDA is a statistical model of document collections that tries to capture this intuition. It is most easily described by its generative process, the imaginary random process by which the model assumes the documents arose. (The interpretation of LDA as a probabilistic model is fleshed out later.)
Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.
We formally define a topic to be a distribution over a fixed vocabulary.
For example, the genetics topic has words about genetics with high probability and the evolutionary biology topic has words about evolutionary biology with high probability.
We assume that these topics are specified before any data has been generated.
library(ggplot2)
genetic.words <- c('gene', 'dna', 'genetic', 'others')
genetic.proportions <- c(40, 30, 20, 10)
par(fig=c(0, 0.3, 0, 1), new=TRUE)
pie(genetic.proportions, labels=genetic.words, main="Genetics Topic", clockwise=TRUE, col=rainbow(length(genetic.words)))
evolution.words <- c('life', 'evolve', 'organism', 'others')
evolution.proportions <- c(50, 20, 25, 5)
par(fig=c(0.35, 0.65, 0, 1), new=TRUE)
pie(evolution.proportions, labels=evolution.words, main="Evolutionary\nBiology\n Topic", clockwise=TRUE, col=rainbow(length(evolution.words)))
computing.words <- c('data', 'number', 'computer', 'others')
computing.proportions <- c(35, 10, 50, 5)
par(fig=c(0.7, 1, 0, 1), new=TRUE)
pie(computing.proportions, labels=computing.words, main="Data Analysis\nTopic", clockwise=TRUE, col=rainbow(length(computing.words)))
Now for each document in the collection, we generate the words in a two-stage process.
This statistical model reflects the intuition that documents exhibit multiple topics.
Each document exhibits the topics in different proportion (step #1); each word in each document is drawn from one of the topics (step #2b), where the selected topic is chosen from the per-document distribution over topics (step #2a).
Example: In the example article, the distribution over topics would place probability on genetics, data analysis, and evolutionary biology, and each word is drawn from one of those three topics.
Notice that the next article in the collection might be about data analysis and neuroscience; its distribution over topics would place probability on those two topics.
This is the distinguishing characteristic of latent Dirichlet allocation all the documents in the collection share the same set of topics, but each document exhibits those topics in different proportion.
LDA and other topic models are part of the larger field of probabilistic modeling.
In generative probabilistic modeling,
We can describe LDA more formally with the following notation.
topics are \(b_{1:K}\) , where each \(b_k\) is a distribution over the vocabulary (the distributions over words at left in Figure 1).topic proportions for the \(d^{th}\) document are \(\theta_{d}\) , where \(\theta_{d,k}\) is the topic proportion for topic \(k\) in document \(d\) (the car- toon histogram in Figure 1).topic assignments for the \(d^{th}\) document are \(z_d\) , where \(z_{d,n}\) is the topic assignment for the \(n^{th}\) word in document \(d\) (the colored coin in Figure 1).observed words for document d are \(w_d\) ,where \(w_{d,n}\) is the \(n^{th}\) word in document d, which is an element from the fixed vocabulary.Keyword you must understand word, topic, document
Step 1: Generate word distributions for each topic
Step 2: Topic Distribution for document
Step 3: For each word, choose a topic and a word from that topic
All you know is observed words
Intractable Problem
Gibbs Sampling, Variational Method