This is a very quick implementation of topic modeling using quanteda and STM. The corpus used is the UN General Debate Corpus.
A very good introduction to topic modeling with STM is provided in the STM vignette by Molly Roberts and colleagues. The code below closely follows the vignette.
We begin by loading the two packages (install from CRAN if you don’t already have them).
library("quanteda")
Package version: 1.3.0
Parallel computing: 2 of 4 threads used.
See https://quanteda.io for tutorials and examples.
Attache Paket: ‘quanteda’
The following object is masked from ‘package:utils’:
View
library("stm")
stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
Next we load the data. The corpus is identical with the version stored in the Harvard Dataverse, but with some additional metadata. The data can be downloaded here.
load("UNgeneraldebate.corpus.RData")
head(uncorpus.stats, 100)
Text <fctr> | Types <dbl> | Tokens <int> | Sentences <int> | country_code <chr> | session <int> | year <int> | ||
---|---|---|---|---|---|---|---|---|
1 | ALB_25_1970 | 1728 | 9078 | 257 | ALB | 25 | 1970 | |
2 | ARG_25_1970 | 1425 | 5192 | 218 | ARG | 25 | 1970 | |
3 | AUS_25_1970 | 1612 | 5690 | 270 | AUS | 25 | 1970 | |
4 | AUT_25_1970 | 1340 | 4717 | 164 | AUT | 25 | 1970 | |
5 | BEL_25_1970 | 1288 | 4786 | 207 | BEL | 25 | 1970 | |
6 | BLR_25_1970 | 1427 | 6138 | 204 | BLR | 25 | 1970 | |
7 | BOL_25_1970 | 1560 | 5613 | 226 | BOL | 25 | 1970 | |
8 | BRA_25_1970 | 1333 | 4427 | 154 | BRA | 25 | 1970 | |
9 | CAN_25_1970 | 728 | 1887 | 97 | CAN | 25 | 1970 | |
10 | CMR_25_1970 | 928 | 3144 | 106 | CMR | 25 | 1970 |
Now we calculate a document feature matrix (DFM), which is basically a table in which rows are texts and columns are words. We remove numbers, symbols, punctuation and standard English stop words and trim the DFM. Trimming in this case means both removing features which are rare (appearing in less that 7.5% of all documents) and ubiquitous (appearing in more than 90% of documents). Note that an untrimmed DFM will contain a lot of noise, slowing down processing without improving quality.
uncorpus.dfm <- dfm(uncorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("english"))
uncorpus.dfm.trim <- dfm_trim(uncorpus.dfm, min_docfreq = 0.075, max_docfreq = 0.90, docfreq_type = "prop") # min 7.5% / max 95%
uncorpus.dfm.trim
Document-feature matrix of: 7,897 documents, 2,479 features (77.3% sparse).
We fit the STM model, here using a setting of k = 40 topics, and list the 10 terms with the highest topic probability for each topic.
Note: we have prepared the data in advance and simply load it here, but run the commented function call below to see STM do its iterative magic. Spectral initialization makes the results reproducible.
topic.count <- 40
dfm2stm <- convert(uncorpus.dfm.trim, to = "stm")
#model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = topic.count, data = dfm2stm$meta, init.type = "Spectral") # this is the actual stm call
load("UNgeneraldebate.stm.RData")
data.frame(t(labelTopics(model.stm, n = 10)$prob))
X1 <fctr> | X2 <fctr> | X3 <fctr> | X4 <fctr> | X5 <fctr> | X6 <fctr> | X7 <fctr> | |
---|---|---|---|---|---|---|---|
council | lebanon | peoples | islands | african | nuclear | european | |
reform | arab | struggle | island | africa | weapons | europe | |
organization | palestinian | independence | government | republic | disarmament | cooperation | |
weapons | lebanese | national | regional | like | pakistan | union | |
member | region | viet | national | central | india | bosnia | |
nuclear | resolutions | nam | republic | solidarity | treaty | region | |
japan | organization | aggression | programme | congo | arms | republic | |
treaty | peoples | liberation | year | organization | south | process | |
important | role | forces | french | president | global | regional | |
role | rights | sovereignty | assistance | particular | indian | stability |
Let’s plot a few heuristics. Note that these are plot.STM custom plots included in the package. The plots show total topic share (a), topic constrast between two topics (b) and topic proportions within documents (c).
plot(model.stm, type = "summary", text.cex = 0.5)
plot(model.stm, type = "perspectives", topics = c(16,21)) # Topics #16 and #21
plot(model.stm, type = "hist", topics = sample(1:topic.count, size = 9))
Next we do an effect estimation of the topic prevalence over time.
model.stm.labels <- labelTopics(model.stm, 1:topic.count)
dfm2stm$meta$datum <- as.numeric(dfm2stm$meta$year)
model.stm.ee <- estimateEffect(1:topic.count ~ country + s(year), model.stm, meta = dfm2stm$meta)
Now we plot this estimation for a handful of topics (here 9 randomly chosen ones).
par(mfrow=c(3,3))
for (i in seq_along(sample(1:topic.count, size = 9)))
{
plot(model.stm.ee, "year", method = "continuous", topics = i, main = paste0(model.stm.labels$prob[i,1:3], collapse = ", "), printlegend = F)
}
See below for plotting all 40 topics and saving the result to hard drive.
# Plots of topic prevalence over time
#png(width = 800, height = 800)
#for (i in 1:topic.count)
#{
# plot(model.stm.ee, "year", method = "continuous", topics = i, main = paste0(model.stm.labels$prob[i,1:3], collapse = ", "), printlegend = F)
#}
#dev.off()