In this article we will look how to fit a Latent Topic model and interpret the topics. For illustration, I will demonstrate topic analysis based on SEC 10-K filings of 20 randomly chosen firms in the Technology sector from the Fortune 1000.

The first step in topic mining is to process the text data and create a Document Term Matrix. For your homework we will process the text data and will provide the Document Term Matrix and Text data as R object (.Rds) so that you can directly load them in your R work space. For fitting and visualising a Latent Topic Model, we will need three packages.

If you have not installed these packages, please install them using command install.packages(“package-name”).

Clear the workspace and invoke required libraries

rm(list=ls()) # Clear the workspace

library("tm")
## Warning: package 'tm' was built under R version 3.2.3
## Warning: package 'NLP' was built under R version 3.2.3
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.2.4
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library("maptpx")
## Warning: package 'maptpx' was built under R version 3.2.4
## Warning: package 'slam' was built under R version 3.2.3
library("igraph")
## Warning: package 'igraph' was built under R version 3.2.4

Now read the text data and DocumentTermMatrix (DTM) in your R session with readRDS function. textdata is a data frame with 2 columns. First column is Company name and second column is text of Item 1- Business from 10K filing. From second column of this text data, I created DTM

textdata = readRDS("BD.Technology.rds") 
dtm1 = readRDS("dtm1.BD.rds")

Next we (i.e., the user) have to input the number of latent topics we think there are in the corpus.]

[Of course, a priori, we wouldn’t know how many topics are really there. So we do trial-and-error. We choose K = 2, then 3, then 4 and so on till we think the results look reasonable and interpretable.]

tryin no of topics as 2,3 and 4 , we decided to choose upon k as 3. So ours is 3 topic latent model.

K = 3 # Choose number of topics in the model
simfit = topics(dtm1,  K = K, verb = 2) # Fit the K topic model
## 
## Estimating on a 85 document collection.
## Fitting the 3 topic model.
## log posterior increase: 4189.1, 1706.6, done. (L = -2397182.5)
summary(simfit, nwrd = 12)  # Summary of simfit model
## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'adhesion', 'advantedge', 'appliedmaterials', 'baccini', 'centura', 'chipmaking', 'dpn', 'duv', 'endura', 'epitaxy', 'equipment-services', 'eterna' (69) 
## [2] 'teradata-division', 'aster', 'mapreduce', 'joining-teradata', 'mcdonald', 'langos', 'nyquist', 'platform-family', 'wimmer', 'blanton', 'koehler', 'scheppmann' (16.5) 
## [3] 'lien', 'prepares', 'protests', 'reynolds', 'submission', 'ira', 'trail', 'backgrounds', 'compile', 'custodian', 'custodians', 'depository' (14.5) 
## 
## Dispersion = 2.64

Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics. Then, the Topic model gives us mainly two outputs:

One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.

Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.

E.g., let’s view the term probability matrix \(\theta\):

simfit$theta[1:10,]
##               topic
## phrase                    1            2            3
##   aberle       1.048846e-08 6.208698e-05 5.776778e-10
##   abi-research 1.730882e-06 1.170353e-04 5.782557e-10
##   abide        2.261884e-06 9.410876e-05 5.797659e-10
##   abilities    1.456643e-07 8.067324e-05 2.574736e-05
##   ability      1.885960e-03 2.030214e-07 1.385226e-03
##   abroad       6.137769e-05 1.149882e-06 8.555536e-05
##   absence      1.447981e-06 6.236133e-05 4.063358e-05
##   absolute     2.796881e-07 1.106695e-04 3.830797e-05
##   absorption   3.865326e-08 1.033893e-04 5.778734e-10
##   abundant     3.727291e-08 4.126525e-05 5.778194e-10

Let’s sort this matrix with decreasing order of total term probability and check the few top terms

a0 = apply(simfit$theta, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]
##              topic
## phrase                   1            2            3
##   services    6.822479e-03 2.624353e-06 1.160616e-02
##   solutions   3.084437e-03 8.406747e-05 9.128283e-03
##   products    1.091979e-02 3.111718e-07 3.106755e-04
##   business    4.307310e-03 7.076063e-06 5.805670e-03
##   customers   9.436238e-03 3.148863e-07 3.064247e-08
##   company     4.016365e-03 6.067617e-07 4.320089e-03
##   clients     6.240051e-05 4.951274e-09 8.055256e-03
##   provide     4.369962e-03 6.308906e-07 3.116367e-03
##   information 2.560277e-03 2.780112e-06 4.616670e-03
##   content     6.315863e-04 2.635999e-05 6.259498e-03

Here you can see product, services, applications etc have higher probability for topic 2. Similarly we can see the \(\omega\) matrix for documents.

simfit$omega[1:10,]
##         topic
## document          1            2            3
##       1  0.99981305 6.693992e-05 0.0001200120
##       2  0.99963670 1.353498e-04 0.0002279463
##       3  0.99983087 6.838078e-05 0.0001007477
##       4  0.04862695 1.473530e-04 0.9512256930
##       5  0.99977716 7.502209e-05 0.0001478197
##       6  0.88745625 1.115747e-01 0.0009690774
##       7  0.64201252 1.828276e-01 0.1751598777
##       8  0.68601405 3.137507e-01 0.0002352900
##       9  0.70572260 9.754706e-02 0.1967303405
##       10 0.66359037 1.090356e-01 0.2273740408

We can say Document 1 loads heavily on topic 2 whereas document 2 loads heavily on topic 1. Document 3 is mix of topic 1 and topic 2

Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.

Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.

Based on the number of terms in DocumentTermMatrix lift calculation may take some time. In your case it should be completed in 2-3 minutes

t = Sys.time()
theta = simfit$theta
lift = theta*0;  sum1 = sum(dtm1)
for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]     # term i's probability of topic j membership
    pterm = sum(dtm1[,i])/sum1   # marginal probability of term i's occurrence in corpus
    lift[i, j] = ptermtopic/pterm   # so, lift is topic membership probability normalized by occurrence probability
      }
}
Sys.time() - t # Total time for calculating lift
## Time difference of 4.755205 mins

let’s print lift for first 10 terms. [Remember, Lift will have the same dimension as the \(\theta\) matrix.]

lift[1:10,]
##               topic
## phrase                   1            2            3
##   aberle       0.001019517 6.0350821306 5.615240e-05
##   abi-research 0.084124021 5.6881283121 2.810429e-05
##   abide        0.131918071 5.4886300011 3.381322e-05
##   abilities    0.008495463 4.7050410163 1.501643e+00
##   ability      1.255631905 0.0001351673 9.222535e-01
##   abroad       1.118650555 0.0209573890 1.559305e+00
##   absence      0.084449428 3.6370497365 2.369840e+00
##   absolute     0.011651448 4.6103491280 1.595861e+00
##   absorption   0.002254343 6.0298887416 3.370285e-05
##   abundant     0.005434595 6.0167006568 8.424925e-05

Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordclod will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])

# Plot wordcloud
wordcloud(rownames(top_word), top_word,  scale=c(4,0.5), 1,
          random.order=FALSE, random.color=FALSE, 
          colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}

To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. As we did in topic wordcloud, first we will find top terms for a topic and then we will construct a co-occurrence matrix. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges.

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])

# now for top 30 words let's find Document Term Matrix
mat  = dtm1[,match(row.names(top_word),colnames(dtm1))]

mat = as.matrix(mat)
cmat  = t(mat) %*% (mat)
diag(cmat) = 0

# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
  vec = cmat[p,]
  cutoff = sort(vec, decreasing = T)[2]
  cmat[p,][cmat[p,] < cutoff] = 0
}

#cmat[cmat <  quantile(cmat,.80)] = 0

graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)

plot(graph,         #the graph to be plotted
 layout=layout.fruchterman.reingold,    # the layout method. 
 vertex.frame.color='blue',         #the color of the border of the dots 
 vertex.label.color='black',        #the color of the name labels
 vertex.label.font=1,           #the font of the name labels
 vertex.size = .00001,   # Dots size
 vertex.label.cex=1.3)
 mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}

Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.

eta = function(mat, dtm) {

mat1 = mat/mean(mat);  terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))

for (i in 1:nrow(dtm)){
    a11 = as.data.frame(matrix(dtm[i,])); 
    rownames(a11) = colnames(dtm)
    a12 = as.matrix(a11[(a11>0),]);  
    rownames(a12) = rownames(a11)[(a11>0)]; 
    rownames(a12)[1:4]
    a13 = intersect(terms1, rownames(a12)); 
    a13[1:15];  length(a13)
    a14a = match(a13, terms1);      # positions of matching terms in mat1 matrix
    a14b = match(a13, rownames(a12))        
    a15 = mat1[a14a,]*matrix(rep(a12[a14b,], 
                                 ncol(mat1)), 
                             ncol = ncol(mat1))
    eta.mat = rbind(eta.mat, apply(a15, 2, mean))   
    rm(a11, a12, a13, a14a, a14b, a15)
  }
  eta.mat = eta.mat[2:nrow(eta.mat), ]  # remove top zeroes row
  row.names(eta.mat)=row.names(dtm)
  return(eta.mat)
}

twc = eta(lift, dtm1)
head(twc)
##          1          2         3
## 1 2.444581 0.01068431 0.6390536
## 2 1.687751 0.01428670 0.4558064
## 3 2.539287 0.01710608 0.5575338
## 4 1.226948 0.31564554 4.6157593
## 5 2.248299 0.01565627 0.7026517
## 6 1.931055 0.99765525 0.8907726

Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.

Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic

eta.file.name = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),1]     # Store first n companies name in list  
      }
  return(s)
}

temp1 = eta.file.name(twc,textdata,5)

for (i in 1:length(temp1)){
  print(paste('Companies loading heavily on topic',i,'are'))
  print(temp1[[i]])
  print('--------------------------')
}
## [1] "Companies loading heavily on topic 1 are"
## [1] "APPLIED MATERIALS INC"        "ADVANCED MICRO DEVICES"      
## [3] "TELEPHONE & DATA SYSTEMS INC" "ADOBE SYSTEMS INC"           
## [5] "CENTURYLINK INC"             
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "QUALCOMM INC"           "CORNING INC"           
## [3] "TRIMBLE NAVIGATION LTD" "HARRIS CORP"           
## [5] "ECHOSTAR CORP"         
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "ADOBE SYSTEMS INC"            "DST SYSTEMS INC"             
## [3] "AUTOMATIC DATA PROCESSING"    "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "IAC/INTERACTIVECORP"         
## [1] "--------------------------"

Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.

eta.file = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),2]     # Store first n documents in list  
      }
  return(s)
}

temp2 = eta.file(twc,textdata,5)

# for (i in 1:length(temp2)){
#   print(paste('Documents loading heavily on topic',i,'are'))
#   print(temp2[[i]])
#   print('--------------------------')
# }

Now based on wordclouds, document content and companies name we have interpretred and label topics.

Business Description : IT services & Solutions Justification : Upon doing k=3 we get to see that services and solutions around information technology and sofwatres is the focus area of the business.