Statistical Learning: Week-15

Structural Topic Models (STM)

A gentle introduction to STM using international armament cooperation (IAC) data

# Load required packages and data
library(stm)

## Warning: package 'stm' was built under R version 4.5.3

## stm v1.3.8 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

# Load our meta data and pre-processed LDA object
data_sub <- readRDS(url("https://www.dropbox.com/s/eh6ep50b7fkucwv/data_sub.rds?dl=1"))  # A subset of the original data, consisting of the text data and document-level meta data
iac_lda <- readRDS(url("https://www.dropbox.com/s/kazrof2ib2n4lz7/iac_lda.rds?dl=1")) # Preprocessed term-document matrix and the list of vocabulary

# Check out the class, dimension, and content of the two loaded objects
class(data_sub)

## [1] "data.frame"

dim(data_sub)

## [1] 287   5

names(data_sub)

## [1] "text"          "from"          "alliance"      "diff"         
## [5] "transfer_type"

# Look at the first three rows of the metadata
head(data_sub, 3)

##                                                                                                                                                                                                                                                                                                                               text
## 1 uk is the sole tier- partner defined by the level of financial commitment each partner made and the level of insights contributed to the development and iote phases in the f- jsf program the uk shared its unique vstol experiences in operating the harrier jets uk contractor bae participates in the production of fuselage
## 2                                                                       italy is one of the two tier- partner its program representatives were allowed to share insights in the design and developmnet phases italitan contractor alenia will assemble f-s designated for european users andparticipate in the production of wings
## 3                                                                                                                  netherlandsis one of the two tier- partner its program representatives were allowed to share insights in the design and developmnet phases dutch contractor stork participates in the production of panelsdoors
##   from  alliance diff transfer_type
## 1    1 0.6547643    0             3
## 2    1 0.6547643    3             3
## 3    1 0.6547643    1             3

class(iac_lda)

## [1] "list"

names(iac_lda)  # This dataframe consists of the original document-term matrix and a list of vocabulary

## [1] "documents" "vocab"

# Check out what a data entry in iac_lda$documents looks like
head(iac_lda$documents, 1)

## $text1
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,]    0    1    2    3    4    5    6    7    8     9    10
## [2,]    2    2    2    1    1    1    1    1    1     1     1

# "documents" (of a lda object) is a list whose length is equal to the number of documents in your data. Each element is an integer matrix with two rows. The column index represents a word occurring in the document: the first row is a 0-indexed word identifier for the j-th word in document i; the second row is an integer denoting the frequency with which the word appeared in the document.

# Check out what an entry in iac_lda$vocab looks like
head(iac_lda$vocab, 10)

##  [1] "-"            "partner"      "level"        "development"  "phases"      
##  [6] "program"      "unique"       "operating"    "contractor"   "participates"

# "vocab" (of an lda object) is a character vector containing the vocabulary (collection of words) associated with the word indices used in the documents.

## Pre-processing
# Merge and prepare the data in stm format
out <- prepDocuments(iac_lda$documents, iac_lda$vocab, data_sub) # The first argument is the indexed docs, the second argument is the list of vocabulary, the third argument is the original meta data
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

# Transform the class of $from and $transfer_type variable to factor class
meta$from <- as.factor(meta$from)
meta$transfer_type <- as.factor(meta$transfer_type)

# Now explore the ideal number of topics (K) from the candidate list of 2 to 15 as a function of a list of metadata (variables)
# Note that we include an interaction term alliance*diff
# Use meta data in the data argument
topics <- searchK(docs, vocab, K = c(2, 15),  prevalence =~ alliance + diff + alliance*diff + from + transfer_type, data = meta, verbose = FALSE)

## Warning in stm(documents = heldout$documents, vocab = heldout$vocab, K = k, :
## K=2 is equivalent to a unidimensional scaling model which you may prefer.

plot(topics)

# Estimate a 10-topic STM using transfer_type (IAC outcome) as the only explanatory variable to sort military technology hierarchy into topic classification, set max iteration to 1000

iac.fit <- stm(documents = docs, vocab = vocab, K = 10, prevalence =~ alliance + diff + alliance*diff + from + transfer_type, max.em.its = 1000, data = meta, init.type = "Spectral", verbose = FALSE)

# Plot topic proportion and FREX terms

plot(iac.fit, type = "summary", xlim = c(-0.1, 1), labeltype = "frex", n = 5, topic.names = c('Gov contracts: ','Partnership: ','Missile programs: ', 'Sys. integration: ', 'Sensors: ', 'Missile defense: ', 'Maritime Sys.: ', 'Aegis radar: ', 'Armors: ', 'Aircraft engines: '))  # It's recommended that you first plot out the result, get a sense of how words are distributed across topics, infer what those topics are, and finally label them in the topic.names argument

# labeltype sets the metric used for evaluating the words displayed on the plot
# n is the number of FREX words displayed in each topic

# Topic content as a function of the direction of military technology transfer (from): content =~ from

iac.content <- stm(docs, vocab, K = 10, prevalence =~ alliance + diff + alliance* diff + from + transfer_type, content =~ from, max.em.its = 1000, data = meta, init.type = "Spectral", verbose = FALSE)


# Perspective plot is useful for mapping word distribution by issue position (usually some variables measured on the binary scale, pointing to the two extremes, for example, ('liberal', 'conservative'), ('oppose', 'support'), etc.)

plot(iac.content, type = "perspectives", n = 50, topics = 2, plabels = c('to U.S.', 'from U.S.'), main = "Partnership")   # plabels argument sets the labels displayed on the 'left' and the 'right' end of the horizontal axis

plot(iac.content, type = "perspectives", n = 50, topics = 6, plabels = c('to U.S.', 'from U.S.'), main = "Missile defense")

plot(iac.content, type = "perspectives", n = 50, topics = 8, plabels = c('to U.S.', 'from U.S.'), main = "Aegis radar")

plot(iac.content, type = "perspectives", n = 50, topics = 9, plabels = c('to U.S.', 'from U.S.'), main = "Armor")

plot(iac.content, type = "perspectives", n = 50, topics = 10, plabels = c('to U.S.', 'from U.S.'), main = "Aircraft engines")

# Estimate the relationship between metadata and topic content for topic 1 to 10, two main explanatory variables and indicator of military technology transfer are included as additional control variables

prep <- estimateEffect(1:10 ~ alliance + diff + alliance* diff + from + transfer_type, iac.fit, meta = meta, uncertainty = "Global")

summary(prep, topics=8)  # Using topic 8 (Aegis combat management system) as an example

## 
## Call:
## estimateEffect(formula = 1:10 ~ alliance + diff + alliance * 
##     diff + from + transfer_type, stmobj = iac.fit, metadata = meta, 
##     uncertainty = "Global")
## 
## 
## Topic 8:
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.06508    0.06655   0.978 0.328925    
## alliance       -0.02243    0.11149  -0.201 0.840725    
## diff            0.03486    0.02275   1.532 0.126696    
## from1           0.08817    0.02525   3.491 0.000558 ***
## transfer_type2 -0.08815    0.03385  -2.605 0.009691 ** 
## transfer_type3 -0.08380    0.03702  -2.263 0.024379 *  
## alliance:diff  -0.02981    0.03987  -0.748 0.455293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Plot the net effects by the source of critical technology (U.S. versus non-U.S.)
# You need to use the preprocessed stm object in the first argument, the estimated stm output in the model argument. Also, note how I set the "cov.value1" and "cov.value2" argument, corresponding to whether the military technology under cooperation came from the U.S. (1) or foreign countries (0).
# You need to set labeltype = "custom" in order to create custom.labels

plot(prep, covariate = "from", topics = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),  model = iac.fit, 
     method = "difference", cov.value1 = 1, cov.value2 = 0, 
      main = "Sources of Critical Technology", xlab = "To US                                    From US",
     xlim = c(-.3, .2), labeltype = "custom", custom.labels = c('Gov. Contracts','Partnership','Missile Programs', 
     'Sys. Integration', 'Sensors', 'Missile Defense', 'Maritime Sys.', 'Aegis Radar', 'Armors', 'Aircraft Engines'))

# It turns out that only Aegis Radar and Armor technology exhibit significant difference in terms of their sources (U.S. versus non-U.S.). This finding makes a lot of sense as Aegis Radar is uniquely an U.S. innovation while European defense contractors hold a comparative advantage in armor technology compared to their American counterparts.

Statistical Learning: Week-15

H. K. Tseng

6/02/2026

Structural Topic Models (STM)