Topic Models are unsupervised methods of automatic organizing, understanding, searching and summarizing text documents (Blei 2012). Topic models are built based on term co-occurrence.
Topic Models are mixed-membership models
Each topic is a distribution over the topical terms.
Each document is a mixture of corpus-wide topics.
Each term is drawn from one of the topics.
In reality, we only observe the documents (not the topics). The other structure is underlying, or latent.
The goal is to infer the hidden variables (topics), given what we observe (documents).
In topic modeling, the full TDM or DTM is broken down into two major components:
The first component tells us the importance of the terms in topics, and using that importance information, the second component tells us the importance of topics in the documents.
The Latent Dirichlet Allocation (LDA) model is a generative probabilistic model applied to text introduced by Blei et al. (2002) and Blei et al. (2003). The LDA model extends the probabilistic Latent Semantic Indexing (pLSI) model proposed by Hofmann (1999). LDA is a mixture model and documents are a mixture of the different topics in the model (Blei et al. 2003). LDA forms clusters of co-occurring terms, or topics, where each topic is a distribution over words.
LDA modeling assumptions include:
The text corpus is a set of \(D\) documents and there are \(V\) terms in the corpus. There are \(k\) topic distributions containing \(V\) terms and \(\beta_{i}\) is the multinomial for the \(i\)th topic.
The generative LDA process consists of a few steps. First, for each document, draw a topic distribution, \(\theta_{d} \backsim Dir(\alpha)\), where \(Dir(\alpha)\) is drawn from a uniform Dirichlet distribution and \(\alpha\) is the scaling parameter. The Dirichlet distribution is a continuous multivariate generalization of the \(\beta\) distribution with parameter vector \(\alpha\), which controls the average shape and sparsity of the topic proportions. Then, for each word in the document, draw a topic \(z_{(d,n)} \backsim multi(\theta_{d})\), where \(multi(\theta_{d})\) is a multinomial and draw a word \(w_{(d,n)} \backsim \beta_{(zd,n)}\). The LDA modeling approach determines the posterior distribution of the topics given the document.
The probabilistic generative model estimation is intractable, so one of the following approximate posterior inference algorithms is used in modeling:
The topicmodels package will be used to create the LDA model and the ldatuning package will be used to choose the \(k\) value. Other packages that can be used for LDA modeling and visualization include lda and LDAvis, respectively. LDAvis creates and interactive topic model visualization and is also available in Python. Note: LDAvis is not directly compatible with LDA models created using the topicmodels package.
library(topicmodels) # LDA, CTM
library(ldatuning) # Choosing k
Prior to topic model analysis, any empty documents that exist following preprocessing should be removed.
We use a lemmatized DTM that is reduced using document frequency thresholding named DTM_red. The top 15% of terms based on the document frequency are retained.
The number of topics, \(k\), can be chosen using many metrics. The findTopicsNumber() function in the ldatuning package can be used to evaluate 4 measures.
result <- FindTopicsNumber(
DTM_red,
topics = seq(from = 5, to = 25, by = 1),
metrics = c("CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "VEM",
control = list(seed = 831),
mc.cores = 2L,
verbose = TRUE)
FindTopicsNumber_plot(result)
Based on one or more of the metrics, either the best \(k\) or a \(k\) that is a good trade-off between the metrics can be chosen. In the plot above, Across-Topic Divergence is maximized and Divergence is minimized at 5. Either 9 or 10 may be a good compromise of the 3 metrics.
The LDA() function in the topicmodels package can be used to create an LDA model with \(k\) topics from a DocumentTermMatrix object.
lda_mod <- LDA(DTM_red,
k = 5,
control = list(seed=831))
The distribution of terms in topics is \(\beta\). We can use a custom function beta_plot() to plot the top \(n\) terms per topic.
beta_plot(topic_object = lda_mod,
n = 6)
The distribution of topics in documents is \(\gamma\). The \(\gamma\) values are the estimated proportion of words from each document that are generated from that topic.
We can view the topic assignments for documents using the topics() function. This will display the most probable topic overall for each document, based on the terms in that document.
head(topics(lda_mod))
## 14099 7585 5974 2502 9802 19960
## 4 3 2 5 3 2
We can use a custom function, topic_group_plot() to compare the topical distribution across known factor variable levels.
topic_group_plot(lda_mod, cr, "Rating", plot_by="category")
## var
## Topics 1 2 3 4 5
## 1 42 81 127 210 425
## 2 36 78 151 219 485
## 3 27 45 93 189 565
## 4 33 61 101 212 507
## 5 23 43 77 146 532
We can use a custom function, doc_dist_plot() to obtain the fitted distribution of topics over documents.
doc_dist_plot(lda_mod)
The LDA model has some notable shortcomings, as shown in the above plots. There tends to be very little difference in the topical distribution across documents, with each topic being represented approximately equally across the dataset. Additionally, topics tend to lack exclusivity–meaning that the top terms tend to be the same in more than one topic.
The Structural Topic Model (STM) combines 3 common topic models to create a semi-automated approach to modeling topics, which can also incorporate covariates and metadata in the analysis of text (Roberts et al. 2014). Additionally, unlike the LDA model, topics in STM can be correlated. This model is particularly salient in the topical analysis of open-ended textual data, such as survey data. In STM models that do not include covariates, the modeling approach is akin to the Correlated Topic Model (CTM), as proposed by Blei and Lafferty (2007).
STM is a mixture model, where each document can belong to a mixture of the designated \(k\) topics. Topic proportions, \(\theta_{d}\), can be correlated and the topical prevalence can be impacted by covariates, \(X\), through a regression model \(\theta_{d} \backsim LogisticNormal(X_{Y},\Sigma)\). This allows each document to have its own prior distribution over topics, rather than sharing a global mean. For each word, \(w\), the topic, \(z_{(d,n)}\), is drawn from a response-specific distribution. Conditioned on the topic, a word is chosen from a multinomial distribution over words with parameters, \(\beta_{(zd,n)}\). The topical content covariate, \(U\), allows word use within a topic to vary by content.
In choosing the STM model or assessing the goodness of fit, two measures can be used: semantic cohesion and exclusivity (Roberts et al. 2014). A topic is cohesive when high-probability terms for a topic occur together in documents. A topic is exclusive if the top words of the topic are not likely to also be top words in other topics.
The stm package is used to create structural topic models.
library(stm) # STM
STM will remove infrequent terms automatically, so we can use the unreduced, lemmatized DTM. Since we will use covariates, we will save the variables as metad_vars. The readCorpus() function will preliminarily format a DocumentTermMatrix object for use in the stm package.
corp <- readCorpus(cr_DTM_lem, type="slam")
metad_vars <- cr[,c("Age", "Rating", "Recommended.IND", "Positive.Feedback.Count", "Division.Name", "Department.Name", "Class.Name")]
We use the prepDocuments() function to identify the documents, terms, and metadata and to remove documents based on a lower threshold of 50.
out <- prepDocuments(documents = corp$documents,
vocab = corp$vocab,
meta = metad_vars,
lower.thresh = 50)
## Removing 4895 of 5305 terms (26074 of 118956 tokens) due to frequency
## Your corpus now has 4508 documents, 410 terms and 92882 tokens.
A number of diagnostic measures can be obtained to help inform the choice of \(k\), the number of topics, using the searchK() function in the stm package. We can seek to maximize held-out likelihood and semantic coherence and minimize residual dispersion.
set.seed(831)
stm.search <- searchK(documents = out$documents,
vocab = out$vocab,
K = 10:30,
init.type = "Spectral",
prevalence = ~ Rating + Age + Positive.Feedback.Count + Recommended.IND,
data = out$meta)
plot(stm.search)
Two measures that offer a good tradeoff measure when choosing a \(k\) value are Semantic Coherence and Exclusivity. Semantic Coherence is maximized when the most probable terms in a topic co-occur frequently (which will happen naturally when \(k\) is low and top words are common). To balance this, we can consider Exclusivity, which is high when top terms are unique to a particular topic. We can look for a compromise point that balances the two measures, with the average values for each \(k\) value considered in the search plotted. Based on the plot below, a choice of 15 or 16 may provide a good tradeoff.
Using \(k = 15\), we can build an STM model using Spectral initialization (uses spectral decomposition, or non-negative matrix factorization of the co-occurrence matrix, leading to consistent results).
Prevalence covariates are chosen as those variables that may impact the frequency (or prevalence) of a topic in the document collection. Numeric or factor variables can be included as prevalence covariates. Prevalence covariates in the model include: Rating, Age, Positive.Feedback.Count, Recommended.IND. If a time/date variable is included in the dataset, including it as a ‘time since origin’ prevalence covariate can allow you to map (expected) topical prevalence over time.
Content covariates are chosen as those variables that may impact how a topic occurs in the document collection. Rating is included in the model as a content covariate. Note: including many content covariates can lead to slow convergence.
Interaction terms for covariates can be included.
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 15,
init.type = "Spectral",
prevalence = ~ Rating + Age + Positive.Feedback.Count + Recommended.IND,
content = ~ Rating,
data = out$meta,
seed = 831)
The plot() function can be used on an stm object to provide the topical frequency and top n (default = 3) words (type = "summary", default).
plot(stm_mod,
n = 5,
text.cex = .8)
We can use plot() and type = perspectives to compare two topics or a single topic across two covariate levels to see how the terms differ. We use set.seed() to make the output reproducible. Comparing the content in Topics 3 and Topic 12:
set.seed(831)
plot(stm_mod,
type="perspectives",
topics=c(3,12),
plabels = c("Topic 3","Topic 12"))
Comparing the content (terms based on probability) in Topic 9 across two Rating levels, 1 and 5,
set.seed(832)
plot(stm_mod,
type="perspectives",
topics=9,
covarlevels = c(1,5),
plabels=c("Rating = 1", "Rating = 5"),
main = "Topic 9")
Word Clouds can be created using the cloud() function for the global model (type = "model", the default) and for specific documents (type = "documents" and identifying the documents in the documents argument. The default for the max.words argument is 100.
set.seed(831)
cloud(stm_mod, topic = 1)
The estimateEffect() function can be used to estimate a regression for each topic specified as the dependent variable in the formula argument, using the documents as observations, covariates as independent variables and the topical proportions for the documents as the dependent variable. Confidence intervals are estimated by default.
set.seed(831)
stm_ee <- estimateEffect(1:15~Rating + Age + Recommended.IND,
stmobj = stm_mod,
metadata = out$meta,
uncertainty="Global")
We can use the summary() function to view the full output, along with the statistical significance flags. We will use use indexing to view the summary information for Topic 15 (which removes the significance flags).
summary(stm_ee)[["tables"]][[15]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.117683602 1.013351e-02 11.6133129 9.637647e-31
## Rating2 -0.008112754 1.060374e-02 -0.7650843 4.442615e-01
## Rating3 -0.012280054 1.065081e-02 -1.1529688 2.489844e-01
## Rating4 -0.042307498 1.236368e-02 -3.4219190 6.273478e-04
## Rating5 -0.050604880 1.224795e-02 -4.1317011 3.666291e-05
## Age 0.000141237 9.403535e-05 1.5019560 1.331786e-01
## Recommended.IND1 -0.021289230 7.278934e-03 -2.9247729 3.464291e-03
We can use plot() on an estimateEffect object to get point estimate information (mean topic proportions for each value of the covariate), compare differences (mean difference in topic proportions across two covariate levels) and plot expected topic proportions across a continuous variable. Moderator variables can be included if interactions are incorporated.
First, we can use the labelTopics() function to view the top terms for a topic overall, for the covariate, and for the topic-covariate interaction. If no content covariate is provided, the labelTopics() output includes the top terms based on probability, FREX (frequency and exclusivity), lift and score.
labelTopics(stm_mod,
topics = c(7))
## Topic Words:
## Topic 7: dress, girl, slip, zipper, belt, area, curvy
##
## Covariate Words:
## Group 1: bad, completely, due, put, disappoint, something, ever
## Group 2: sadly, bad, close, return, disappoint, seam, fine
## Group 3: plus, sadly, however, okay, seem, completely, bad
## Group 4: rather, reason, still, other, keep, issue, agree
## Group 5: please, happy, must, purchase, slightly, prefer, texture
##
## Topic-Covariate Interactions:
## Topic 7, Group 1: ivory, exactly, flowy, embroidery, beautiful, gorgeous, fabric
## Topic 7, Group 2: kind, return, navy, want, picture, pink, green
## Topic 7, Group 3: buy, use, yet, day, cold, quality
## Topic 7, Group 4: lovely, high, leg, waisted, hip, regular, problem
## Topic 7, Group 5: vest, gray, pocket, true, summer, perfect, light
##
We can view the expected topic proportions and CI for the Age variable for the model predicting Topic 7.
plot(x = stm_ee,
covariate = "Age",
topic = c(7),
model = stm_mod,
method = "continuous")
We can use type = "pointestimate" to plot the mean topic proportions and CI for the levels of the Rating variable (1-5).
labelTopics(stm_mod,
topics = c(3))
## Topic Words:
## Topic 3: spin-dry, price, wash, quality, worth, hold, sale
##
## Covariate Words:
## Group 1: bad, completely, due, put, disappoint, something, ever
## Group 2: sadly, bad, close, return, disappoint, seam, fine
## Group 3: plus, sadly, however, okay, seem, completely, bad
## Group 4: rather, reason, still, other, keep, issue, agree
## Group 5: please, happy, must, purchase, slightly, prefer, texture
##
## Topic-Covariate Interactions:
## Topic 3, Group 1: already, shoulder, jacket, really
## Topic 3, Group 2: wash, first, exactly
## Topic 3, Group 3: first
## Topic 3, Group 4: white
## Topic 3, Group 5: thin, expect
##
plot(x = stm_ee,
covariate = "Rating",
topic = c(3),
model = stm_mod,
method = "pointestimate",
labeltype = "custom",
custom.labels = c(1:5),
main = "Topic 3 by Rating")