In this article we will look how to fit a Latent Topic model and interpret the topics. For illustration, I will demonstrate topic analysis based on SEC 10-K filings of 20 randomly chosen firms in the Technology sector from the Fortune 1000.
The first step in topic mining is to process the text data and create a Document Term Matrix. For your homework we will process the text data and will provide the Document Term Matrix and Text data as R object (.Rds) so that you can directly load them in your R work space. For fitting and visualising a Latent Topic Model, we will need three packages.
If you have not installed these packages, please install them using command install.packages(“package-name”).
Clear the workspace and invoke required libraries
rm(list=ls()) # Clear the workspace
library("tm")
## Warning: package 'tm' was built under R version 3.2.3
## Warning: package 'NLP' was built under R version 3.2.3
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.2.4
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library("maptpx")
## Warning: package 'maptpx' was built under R version 3.2.4
## Warning: package 'slam' was built under R version 3.2.3
library("igraph")
## Warning: package 'igraph' was built under R version 3.2.4
textdata = readRDS("BD.Technology.rds")
dtm1 = readRDS("dtm1.BD.rds")
Next we (i.e., the user) have to input the number of latent topics we think there are in the corpus.]
[Of course, a priori, we wouldn’t know how many topics are really there. So we do trial-and-error. We choose K = 2, then 3, then 4 and so on till we think the results look reasonable and interpretable.]
tryin no of topics as 2,3 and 4 , we decided to choose upon k as 3. So ours is 3 topic latent model.
K = 3 # Choose number of topics in the model
simfit = topics(dtm1, K = K, verb = 2) # Fit the K topic model
##
## Estimating on a 85 document collection.
## Fitting the 3 topic model.
## log posterior increase: 4189.1, 1706.6, done. (L = -2397182.5)
summary(simfit, nwrd = 12) # Summary of simfit model
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'adhesion', 'advantedge', 'appliedmaterials', 'baccini', 'centura', 'chipmaking', 'dpn', 'duv', 'endura', 'epitaxy', 'equipment-services', 'eterna' (69)
## [2] 'teradata-division', 'aster', 'mapreduce', 'joining-teradata', 'mcdonald', 'langos', 'nyquist', 'platform-family', 'wimmer', 'blanton', 'koehler', 'scheppmann' (16.5)
## [3] 'lien', 'prepares', 'protests', 'reynolds', 'submission', 'ira', 'trail', 'backgrounds', 'compile', 'custodian', 'custodians', 'depository' (14.5)
##
## Dispersion = 2.64
Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics. Then, the Topic model gives us mainly two outputs:
One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.
Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.
E.g., let’s view the term probability matrix \(\theta\):
simfit$theta[1:10,]
## topic
## phrase 1 2 3
## aberle 1.048846e-08 6.208698e-05 5.776778e-10
## abi-research 1.730882e-06 1.170353e-04 5.782557e-10
## abide 2.261884e-06 9.410876e-05 5.797659e-10
## abilities 1.456643e-07 8.067324e-05 2.574736e-05
## ability 1.885960e-03 2.030214e-07 1.385226e-03
## abroad 6.137769e-05 1.149882e-06 8.555536e-05
## absence 1.447981e-06 6.236133e-05 4.063358e-05
## absolute 2.796881e-07 1.106695e-04 3.830797e-05
## absorption 3.865326e-08 1.033893e-04 5.778734e-10
## abundant 3.727291e-08 4.126525e-05 5.778194e-10
Let’s sort this matrix with decreasing order of total term probability and check the few top terms
a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]
## topic
## phrase 1 2 3
## services 6.822479e-03 2.624353e-06 1.160616e-02
## solutions 3.084437e-03 8.406747e-05 9.128283e-03
## products 1.091979e-02 3.111718e-07 3.106755e-04
## business 4.307310e-03 7.076063e-06 5.805670e-03
## customers 9.436238e-03 3.148863e-07 3.064247e-08
## company 4.016365e-03 6.067617e-07 4.320089e-03
## clients 6.240051e-05 4.951274e-09 8.055256e-03
## provide 4.369962e-03 6.308906e-07 3.116367e-03
## information 2.560277e-03 2.780112e-06 4.616670e-03
## content 6.315863e-04 2.635999e-05 6.259498e-03
Here you can see product, services, applications etc have higher probability for topic 2. Similarly we can see the \(\omega\) matrix for documents.
simfit$omega[1:10,]
## topic
## document 1 2 3
## 1 0.99981305 6.693992e-05 0.0001200120
## 2 0.99963670 1.353498e-04 0.0002279463
## 3 0.99983087 6.838078e-05 0.0001007477
## 4 0.04862695 1.473530e-04 0.9512256930
## 5 0.99977716 7.502209e-05 0.0001478197
## 6 0.88745625 1.115747e-01 0.0009690774
## 7 0.64201252 1.828276e-01 0.1751598777
## 8 0.68601405 3.137507e-01 0.0002352900
## 9 0.70572260 9.754706e-02 0.1967303405
## 10 0.66359037 1.090356e-01 0.2273740408
We can say Document 1 loads heavily on topic 2 whereas document 2 loads heavily on topic 1. Document 3 is mix of topic 1 and topic 2
Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.
The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.
Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.
Based on the number of terms in DocumentTermMatrix lift calculation may take some time. In your case it should be completed in 2-3 minutes
t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm1)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm1[,i])/sum1 # marginal probability of term i's occurrence in corpus
lift[i, j] = ptermtopic/pterm # so, lift is topic membership probability normalized by occurrence probability
}
}
Sys.time() - t # Total time for calculating lift
## Time difference of 4.755205 mins
let’s print lift for first 10 terms. [Remember, Lift will have the same dimension as the \(\theta\) matrix.]
lift[1:10,]
## topic
## phrase 1 2 3
## aberle 0.001019517 6.0350821306 5.615240e-05
## abi-research 0.084124021 5.6881283121 2.810429e-05
## abide 0.131918071 5.4886300011 3.381322e-05
## abilities 0.008495463 4.7050410163 1.501643e+00
## ability 1.255631905 0.0001351673 9.222535e-01
## abroad 1.118650555 0.0209573890 1.559305e+00
## absence 0.084449428 3.6370497365 2.369840e+00
## absolute 0.011651448 4.6103491280 1.595861e+00
## absorption 0.002254343 6.0298887416 3.370285e-05
## abundant 0.005434595 6.0167006568 8.424925e-05
Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordclod will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(4,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}
To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. As we did in topic wordcloud, first we will find top terms for a topic and then we will construct a co-occurrence matrix. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges.
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 30 words let's find Document Term Matrix
mat = dtm1[,match(row.names(top_word),colnames(dtm1))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[2]
cmat[p,][cmat[p,] < cutoff] = 0
}
#cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}
Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.
eta = function(mat, dtm) {
mat1 = mat/mean(mat); terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))
for (i in 1:nrow(dtm)){
a11 = as.data.frame(matrix(dtm[i,]));
rownames(a11) = colnames(dtm)
a12 = as.matrix(a11[(a11>0),]);
rownames(a12) = rownames(a11)[(a11>0)];
rownames(a12)[1:4]
a13 = intersect(terms1, rownames(a12));
a13[1:15]; length(a13)
a14a = match(a13, terms1); # positions of matching terms in mat1 matrix
a14b = match(a13, rownames(a12))
a15 = mat1[a14a,]*matrix(rep(a12[a14b,],
ncol(mat1)),
ncol = ncol(mat1))
eta.mat = rbind(eta.mat, apply(a15, 2, mean))
rm(a11, a12, a13, a14a, a14b, a15)
}
eta.mat = eta.mat[2:nrow(eta.mat), ] # remove top zeroes row
row.names(eta.mat)=row.names(dtm)
return(eta.mat)
}
twc = eta(lift, dtm1)
head(twc)
## 1 2 3
## 1 2.444581 0.01068431 0.6390536
## 2 1.687751 0.01428670 0.4558064
## 3 2.539287 0.01710608 0.5575338
## 4 1.226948 0.31564554 4.6157593
## 5 2.248299 0.01565627 0.7026517
## 6 1.931055 0.99765525 0.8907726
Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.
Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic
eta.file.name = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),1] # Store first n companies name in list
}
return(s)
}
temp1 = eta.file.name(twc,textdata,5)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}
## [1] "Companies loading heavily on topic 1 are"
## [1] "APPLIED MATERIALS INC" "ADVANCED MICRO DEVICES"
## [3] "TELEPHONE & DATA SYSTEMS INC" "ADOBE SYSTEMS INC"
## [5] "CENTURYLINK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "QUALCOMM INC" "CORNING INC"
## [3] "TRIMBLE NAVIGATION LTD" "HARRIS CORP"
## [5] "ECHOSTAR CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "ADOBE SYSTEMS INC" "DST SYSTEMS INC"
## [3] "AUTOMATIC DATA PROCESSING" "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "IAC/INTERACTIVECORP"
## [1] "--------------------------"
Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.
eta.file = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),2] # Store first n documents in list
}
return(s)
}
temp2 = eta.file(twc,textdata,5)
# for (i in 1:length(temp2)){
# print(paste('Documents loading heavily on topic',i,'are'))
# print(temp2[[i]])
# print('--------------------------')
# }
Now based on wordclouds, document content and companies name we have interpretred and label topics.
Business Description : IT services & Solutions Justification : Upon doing k=3 we get to see that services and solutions around information technology and sofwatres is the focus area of the business.