This is a very quick implementation of topic modeling using quanteda and STM. The corpus used is the UN General Debate Corpus.

A very good introduction to topic modeling with STM is provided in the STM vignette by Molly Roberts and colleagues. The code below closely follows the vignette.

We begin by loading the two packages (install from CRAN if you don’t already have them).

library("quanteda")
Package version: 1.3.0
Parallel computing: 2 of 4 threads used.
See https://quanteda.io for tutorials and examples.

Attache Paket: ‘quanteda’

The following object is masked from ‘package:utils’:

    View
library("stm")
stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com

Next we load the data. The corpus is identical with the version stored in the Harvard Dataverse, but with some additional metadata. The data can be downloaded here.

load("UNgeneraldebate.corpus.RData")
head(uncorpus.stats, 100)
ABCDEFGHIJ0123456789
 
 
Text
<fctr>
Types
<dbl>
Tokens
<int>
Sentences
<int>
country_code
<chr>
session
<int>
year
<int>
1ALB_25_197017289078257ALB251970
2ARG_25_197014255192218ARG251970
3AUS_25_197016125690270AUS251970
4AUT_25_197013404717164AUT251970
5BEL_25_197012884786207BEL251970
6BLR_25_197014276138204BLR251970
7BOL_25_197015605613226BOL251970
8BRA_25_197013334427154BRA251970
9CAN_25_1970728188797CAN251970
10CMR_25_19709283144106CMR251970

Now we calculate a document feature matrix (DFM), which is basically a table in which rows are texts and columns are words. We remove numbers, symbols, punctuation and standard English stop words and trim the DFM. Trimming in this case means both removing features which are rare (appearing in less that 7.5% of all documents) and ubiquitous (appearing in more than 90% of documents). Note that an untrimmed DFM will contain a lot of noise, slowing down processing without improving quality.

uncorpus.dfm <- dfm(uncorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("english"))
uncorpus.dfm.trim <- dfm_trim(uncorpus.dfm, min_docfreq = 0.075, max_docfreq = 0.90, docfreq_type = "prop") # min 7.5% / max 95%
uncorpus.dfm.trim
Document-feature matrix of: 7,897 documents, 2,479 features (77.3% sparse).

We fit the STM model, here using a setting of k = 40 topics, and list the 10 terms with the highest topic probability for each topic.

Note: we have prepared the data in advance and simply load it here, but run the commented function call below to see STM do its iterative magic. Spectral initialization makes the results reproducible.

topic.count <- 40
dfm2stm <- convert(uncorpus.dfm.trim, to = "stm")
#model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = topic.count, data = dfm2stm$meta, init.type = "Spectral") # this is the actual stm call
load("UNgeneraldebate.stm.RData")
data.frame(t(labelTopics(model.stm, n = 10)$prob))
ABCDEFGHIJ0123456789
X1
<fctr>
X2
<fctr>
X3
<fctr>
X4
<fctr>
X5
<fctr>
X6
<fctr>
X7
<fctr>
councillebanonpeoplesislandsafricannucleareuropean
reformarabstruggleislandafricaweaponseurope
organizationpalestinianindependencegovernmentrepublicdisarmamentcooperation
weaponslebanesenationalregionallikepakistanunion
memberregionvietnationalcentralindiabosnia
nuclearresolutionsnamrepublicsolidaritytreatyregion
japanorganizationaggressionprogrammecongoarmsrepublic
treatypeoplesliberationyearorganizationsouthprocess
importantroleforcesfrenchpresidentglobalregional
rolerightssovereigntyassistanceparticularindianstability

Let’s plot a few heuristics. Note that these are plot.STM custom plots included in the package. The plots show total topic share (a), topic constrast between two topics (b) and topic proportions within documents (c).

plot(model.stm, type = "summary", text.cex = 0.5)

plot(model.stm, type = "perspectives", topics = c(16,21)) # Topics #16 and #21

plot(model.stm, type = "hist", topics = sample(1:topic.count, size = 9))

Next we do an effect estimation of the topic prevalence over time.

model.stm.labels <- labelTopics(model.stm, 1:topic.count)
dfm2stm$meta$datum <- as.numeric(dfm2stm$meta$year)
model.stm.ee <- estimateEffect(1:topic.count ~ country + s(year), model.stm, meta = dfm2stm$meta)

Now we plot this estimation for a handful of topics (here 9 randomly chosen ones).

par(mfrow=c(3,3))
for (i in seq_along(sample(1:topic.count, size = 9)))
{
  plot(model.stm.ee, "year", method = "continuous", topics = i, main = paste0(model.stm.labels$prob[i,1:3], collapse = ", "), printlegend = F)
}

See below for plotting all 40 topics and saving the result to hard drive.

# Plots of topic prevalence over time
#png(width = 800, height = 800)
#for (i in 1:topic.count)
#{
#  plot(model.stm.ee, "year", method = "continuous", topics = i, main = paste0(model.stm.labels$prob[i,1:3], collapse = ", "), printlegend = F)
#}
#dev.off()
LS0tCnRpdGxlOiAiVG9waWMgbW9kZWxpbmcgd2l0aCBxdWFudGVkYSBhbmQgU1RNIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpUaGlzIGlzIGEgdmVyeSBxdWljayBpbXBsZW1lbnRhdGlvbiBvZiB0b3BpYyBtb2RlbGluZyB1c2luZyBbcXVhbnRlZGFdKGh0dHBzOi8vcXVhbnRlZGEuaW8vKSBhbmQgW1NUTV0oaHR0cHM6Ly93d3cuc3RydWN0dXJhbHRvcGljbW9kZWwuY29tLykuIFRoZSBjb3JwdXMgdXNlZCBpcyB0aGUgW1VOIEdlbmVyYWwgRGViYXRlIENvcnB1c10oaHR0cDovL3d3dy5zbWlraGF5bG92Lm5ldC91bmdkYy8pLgoKQSB2ZXJ5IGdvb2QgaW50cm9kdWN0aW9uIHRvIHRvcGljIG1vZGVsaW5nIHdpdGggU1RNIGlzIHByb3ZpZGVkIFtpbiB0aGUgU1RNIHZpZ25ldHRlIGJ5IE1vbGx5IFJvYmVydHMgYW5kIGNvbGxlYWd1ZXNdKGh0dHBzOi8vZ2l0aHViLmNvbS9ic3Rld2FydC9zdG0vYmxvYi9tYXN0ZXIvaW5zdC9kb2Mvc3RtVmlnbmV0dGUucGRmP3Jhdz10cnVlKS4gVGhlIGNvZGUgYmVsb3cgY2xvc2VseSBmb2xsb3dzIHRoZSB2aWduZXR0ZS4KCldlIGJlZ2luIGJ5IGxvYWRpbmcgdGhlIHR3byBwYWNrYWdlcyAoaW5zdGFsbCBmcm9tIENSQU4gaWYgeW91IGRvbid0IGFscmVhZHkgaGF2ZSB0aGVtKS4KCmBgYHtyfQpsaWJyYXJ5KCJxdWFudGVkYSIpCmxpYnJhcnkoInN0bSIpCmBgYAoKTmV4dCB3ZSBsb2FkIHRoZSBkYXRhLiBUaGUgY29ycHVzIGlzIGlkZW50aWNhbCB3aXRoIFt0aGUgdmVyc2lvbiBzdG9yZWQgaW4gdGhlIEhhcnZhcmQgRGF0YXZlcnNlXShodHRwczovL2RvaS5vcmcvMTAuNzkxMC9EVk4vMFRKWDhZKSwgYnV0IHdpdGggc29tZSBhZGRpdGlvbmFsIG1ldGFkYXRhLiAqKlRoZSBkYXRhIGNhbiBiZSBkb3dubG9hZGVkIFtoZXJlXShodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzdsZmNhMG5mdXB6d2x5bC9VTmdlbmVyYWxkZWJhdGUuemlwP2RsPTEpKiouCgpgYGB7cn0KbG9hZCgiVU5nZW5lcmFsZGViYXRlLmNvcnB1cy5SRGF0YSIpCmhlYWQodW5jb3JwdXMuc3RhdHMsIDEwMCkKYGBgCgpOb3cgd2UgY2FsY3VsYXRlIGEgZG9jdW1lbnQgZmVhdHVyZSBtYXRyaXggKERGTSksIHdoaWNoIGlzIGJhc2ljYWxseSBhIHRhYmxlIGluIHdoaWNoIHJvd3MgYXJlIHRleHRzIGFuZCBjb2x1bW5zIGFyZSB3b3Jkcy4gV2UgcmVtb3ZlIG51bWJlcnMsIHN5bWJvbHMsIHB1bmN0dWF0aW9uIGFuZCBzdGFuZGFyZCBFbmdsaXNoIHN0b3Agd29yZHMgYW5kIHRyaW0gdGhlIERGTS4gVHJpbW1pbmcgaW4gdGhpcyBjYXNlIG1lYW5zIGJvdGggcmVtb3ZpbmcgZmVhdHVyZXMgd2hpY2ggYXJlIHJhcmUgKGFwcGVhcmluZyBpbiBsZXNzIHRoYXQgNy41JSBvZiBhbGwgZG9jdW1lbnRzKSBhbmQgdWJpcXVpdG91cyAoYXBwZWFyaW5nIGluIG1vcmUgdGhhbiA5MCUgb2YgZG9jdW1lbnRzKS4gTm90ZSB0aGF0IGFuIHVudHJpbW1lZCBERk0gd2lsbCBjb250YWluIGEgbG90IG9mIG5vaXNlLCBzbG93aW5nIGRvd24gcHJvY2Vzc2luZyB3aXRob3V0IGltcHJvdmluZyBxdWFsaXR5LiAKCmBgYHtyfQp1bmNvcnB1cy5kZm0gPC0gZGZtKHVuY29ycHVzLCByZW1vdmVfbnVtYmVycyA9IFRSVUUsIHJlbW92ZV9wdW5jdCA9IFRSVUUsIHJlbW92ZV9zeW1ib2xzID0gVFJVRSwgcmVtb3ZlID0gc3RvcHdvcmRzKCJlbmdsaXNoIikpCnVuY29ycHVzLmRmbS50cmltIDwtIGRmbV90cmltKHVuY29ycHVzLmRmbSwgbWluX2RvY2ZyZXEgPSAwLjA3NSwgbWF4X2RvY2ZyZXEgPSAwLjkwLCBkb2NmcmVxX3R5cGUgPSAicHJvcCIpICMgbWluIDcuNSUgLyBtYXggOTUlCnVuY29ycHVzLmRmbS50cmltCmBgYAoKV2UgZml0IHRoZSBTVE0gbW9kZWwsIGhlcmUgdXNpbmcgYSBzZXR0aW5nIG9mIGsgPSA0MCB0b3BpY3MsIGFuZCBsaXN0IHRoZSAxMCB0ZXJtcyB3aXRoIHRoZSBoaWdoZXN0IHRvcGljIHByb2JhYmlsaXR5IGZvciBlYWNoIHRvcGljLiAKCk5vdGU6IHdlIGhhdmUgcHJlcGFyZWQgdGhlIGRhdGEgaW4gYWR2YW5jZSBhbmQgc2ltcGx5IGxvYWQgaXQgaGVyZSwgYnV0IHJ1biB0aGUgY29tbWVudGVkIGZ1bmN0aW9uIGNhbGwgYmVsb3cgdG8gc2VlIFNUTSBkbyBpdHMgaXRlcmF0aXZlIG1hZ2ljLiBTcGVjdHJhbCBpbml0aWFsaXphdGlvbiBtYWtlcyB0aGUgcmVzdWx0cyByZXByb2R1Y2libGUuCgpgYGB7cn0KdG9waWMuY291bnQgPC0gNDAKZGZtMnN0bSA8LSBjb252ZXJ0KHVuY29ycHVzLmRmbS50cmltLCB0byA9ICJzdG0iKQojbW9kZWwuc3RtIDwtIHN0bShkZm0yc3RtJGRvY3VtZW50cywgZGZtMnN0bSR2b2NhYiwgSyA9IHRvcGljLmNvdW50LCBkYXRhID0gZGZtMnN0bSRtZXRhLCBpbml0LnR5cGUgPSAiU3BlY3RyYWwiKSAjIHRoaXMgaXMgdGhlIGFjdHVhbCBzdG0gY2FsbApsb2FkKCJVTmdlbmVyYWxkZWJhdGUuc3RtLlJEYXRhIikKZGF0YS5mcmFtZSh0KGxhYmVsVG9waWNzKG1vZGVsLnN0bSwgbiA9IDEwKSRwcm9iKSkKYGBgCgpMZXTigJlzIHBsb3QgYSBmZXcgaGV1cmlzdGljcy4gTm90ZSB0aGF0IHRoZXNlIGFyZSBbcGxvdC5TVE1dKGh0dHBzOi8vd3d3LnJkb2N1bWVudGF0aW9uLm9yZy9wYWNrYWdlcy9zdG0vdmVyc2lvbnMvMS4zLjMvdG9waWNzL3Bsb3QuU1RNKSBjdXN0b20gcGxvdHMgaW5jbHVkZWQgaW4gdGhlIHBhY2thZ2UuIFRoZSBwbG90cyBzaG93IHRvdGFsIHRvcGljIHNoYXJlIChhKSwgdG9waWMgY29uc3RyYXN0IGJldHdlZW4gdHdvIHRvcGljcyAoYikgYW5kIHRvcGljIHByb3BvcnRpb25zIHdpdGhpbiBkb2N1bWVudHMgKGMpLgoKYGBge3J9CnBsb3QobW9kZWwuc3RtLCB0eXBlID0gInN1bW1hcnkiLCB0ZXh0LmNleCA9IDAuNSkKcGxvdChtb2RlbC5zdG0sIHR5cGUgPSAicGVyc3BlY3RpdmVzIiwgdG9waWNzID0gYygxNiwyMSkpICMgVG9waWNzICMxNiBhbmQgIzIxCnBsb3QobW9kZWwuc3RtLCB0eXBlID0gImhpc3QiLCB0b3BpY3MgPSBzYW1wbGUoMTp0b3BpYy5jb3VudCwgc2l6ZSA9IDkpKQpgYGAKCk5leHQgd2UgZG8gYW4gZWZmZWN0IGVzdGltYXRpb24gb2YgdGhlIHRvcGljIHByZXZhbGVuY2Ugb3ZlciB0aW1lLgoKYGBge3J9Cm1vZGVsLnN0bS5sYWJlbHMgPC0gbGFiZWxUb3BpY3MobW9kZWwuc3RtLCAxOnRvcGljLmNvdW50KQpkZm0yc3RtJG1ldGEkZGF0dW0gPC0gYXMubnVtZXJpYyhkZm0yc3RtJG1ldGEkeWVhcikKbW9kZWwuc3RtLmVlIDwtIGVzdGltYXRlRWZmZWN0KDE6dG9waWMuY291bnQgfiBjb3VudHJ5ICsgcyh5ZWFyKSwgbW9kZWwuc3RtLCBtZXRhID0gZGZtMnN0bSRtZXRhKQpgYGAKCk5vdyB3ZSBwbG90IHRoaXMgZXN0aW1hdGlvbiBmb3IgYSBoYW5kZnVsIG9mIHRvcGljcyAoaGVyZSA5IHJhbmRvbWx5IGNob3NlbiBvbmVzKS4gCgpgYGB7cn0KcGFyKG1mcm93PWMoMywzKSkKZm9yIChpIGluIHNlcV9hbG9uZyhzYW1wbGUoMTp0b3BpYy5jb3VudCwgc2l6ZSA9IDkpKSkKewogIHBsb3QobW9kZWwuc3RtLmVlLCAieWVhciIsIG1ldGhvZCA9ICJjb250aW51b3VzIiwgdG9waWNzID0gaSwgbWFpbiA9IHBhc3RlMChtb2RlbC5zdG0ubGFiZWxzJHByb2JbaSwxOjNdLCBjb2xsYXBzZSA9ICIsICIpLCBwcmludGxlZ2VuZCA9IEYpCn0KYGBgCgpTZWUgYmVsb3cgZm9yIHBsb3R0aW5nIGFsbCA0MCB0b3BpY3MgYW5kIHNhdmluZyB0aGUgcmVzdWx0IHRvIGhhcmQgZHJpdmUuIAoKYGBge3J9CiMgUGxvdHMgb2YgdG9waWMgcHJldmFsZW5jZSBvdmVyIHRpbWUKI3BuZyh3aWR0aCA9IDgwMCwgaGVpZ2h0ID0gODAwKQojZm9yIChpIGluIDE6dG9waWMuY291bnQpCiN7CiMgIHBsb3QobW9kZWwuc3RtLmVlLCAieWVhciIsIG1ldGhvZCA9ICJjb250aW51b3VzIiwgdG9waWNzID0gaSwgbWFpbiA9IHBhc3RlMChtb2RlbC5zdG0ubGFiZWxzJHByb2JbaSwxOjNdLCBjb2xsYXBzZSA9ICIsICIpLCBwcmludGxlZ2VuZCA9IEYpCiN9CiNkZXYub2ZmKCkKYGBgCg==