Latent Topic Mining

In this article we will look how to fit a Latent Topic model and interpret the topics. For illustration, I will demonstrate topic analysis based on SEC 10-K filings of 20 randomly chosen firms in the Technology sector from the Fortune 1000.

The first step in topic mining is to process the text data and create a Document Term Matrix. For your homework we will process the text data and will provide the Document Term Matrix and Text data as R object (.Rds) so that you can directly load them in your R work space. For fitting and visualising a Latent Topic Model, we will need three packages.

tm - for creating DocumentTermMatrix
maptpx - for fitting Latent Topic Model
wordcloud - for plotting wordclouds
igraph - for plotting terms co-occurrence graph

If you have not installed these packages, please install them using command install.packages(“package-name”).

Clear the workspace and invoke required libraries

rm(list=ls()) # Clear the workspace

library("tm")

## Warning: package 'tm' was built under R version 3.2.3

## Warning: package 'NLP' was built under R version 3.2.3

library("wordcloud")

## Warning: package 'wordcloud' was built under R version 3.2.4

## Warning: package 'RColorBrewer' was built under R version 3.2.3

library("maptpx")

## Warning: package 'maptpx' was built under R version 3.2.4

## Warning: package 'slam' was built under R version 3.2.3

library("igraph")

## Warning: package 'igraph' was built under R version 3.2.4

Now read the text data and DocumentTermMatrix (DTM) in your R session with readRDS function. textdata is a data frame with 2 columns. First column is Company name and second column is text of Item 1- Business from 10K filing. From second column of this text data, I created DTM

textdata = readRDS("BD.Technology.rds") 
dtm1 = readRDS("dtm1.BD.rds")

Next we (i.e., the user) have to input the number of latent topics we think there are in the corpus.]

[Of course, a priori, we wouldn’t know how many topics are really there. So we do trial-and-error. We choose K = 2, then 3, then 4 and so on till we think the results look reasonable and interpretable.]

tryin no of topics as 2,3 and 4 , we decided to choose upon k as 3. So ours is 3 topic latent model.

K = 3 # Choose number of topics in the model
simfit = topics(dtm1,  K = K, verb = 2) # Fit the K topic model

## 
## Estimating on a 85 document collection.
## Fitting the 3 topic model.
## log posterior increase: 4189.1, 1706.6, done. (L = -2397182.5)

summary(simfit, nwrd = 12)  # Summary of simfit model

## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'adhesion', 'advantedge', 'appliedmaterials', 'baccini', 'centura', 'chipmaking', 'dpn', 'duv', 'endura', 'epitaxy', 'equipment-services', 'eterna' (69) 
## [2] 'teradata-division', 'aster', 'mapreduce', 'joining-teradata', 'mcdonald', 'langos', 'nyquist', 'platform-family', 'wimmer', 'blanton', 'koehler', 'scheppmann' (16.5) 
## [3] 'lien', 'prepares', 'protests', 'reynolds', 'submission', 'ira', 'trail', 'backgrounds', 'compile', 'custodian', 'custodians', 'depository' (14.5) 
## 
## Dispersion = 2.64

Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics. Then, the Topic model gives us mainly two outputs:

One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.

Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.

E.g., let’s view the term probability matrix \(\theta\):

simfit$theta[1:10,]

##               topic
## phrase                    1            2            3
##   aberle       1.048846e-08 6.208698e-05 5.776778e-10
##   abi-research 1.730882e-06 1.170353e-04 5.782557e-10
##   abide        2.261884e-06 9.410876e-05 5.797659e-10
##   abilities    1.456643e-07 8.067324e-05 2.574736e-05
##   ability      1.885960e-03 2.030214e-07 1.385226e-03
##   abroad       6.137769e-05 1.149882e-06 8.555536e-05
##   absence      1.447981e-06 6.236133e-05 4.063358e-05
##   absolute     2.796881e-07 1.106695e-04 3.830797e-05
##   absorption   3.865326e-08 1.033893e-04 5.778734e-10
##   abundant     3.727291e-08 4.126525e-05 5.778194e-10

Let’s sort this matrix with decreasing order of total term probability and check the few top terms

a0 = apply(simfit$theta, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]

##              topic
## phrase                   1            2            3
##   services    6.822479e-03 2.624353e-06 1.160616e-02
##   solutions   3.084437e-03 8.406747e-05 9.128283e-03
##   products    1.091979e-02 3.111718e-07 3.106755e-04
##   business    4.307310e-03 7.076063e-06 5.805670e-03
##   customers   9.436238e-03 3.148863e-07 3.064247e-08
##   company     4.016365e-03 6.067617e-07 4.320089e-03
##   clients     6.240051e-05 4.951274e-09 8.055256e-03
##   provide     4.369962e-03 6.308906e-07 3.116367e-03
##   information 2.560277e-03 2.780112e-06 4.616670e-03
##   content     6.315863e-04 2.635999e-05 6.259498e-03

Here you can see product, services, applications etc have higher probability for topic 2. Similarly we can see the \(\omega\) matrix for documents.

simfit$omega[1:10,]

##         topic
## document          1            2            3
##       1  0.99981305 6.693992e-05 0.0001200120
##       2  0.99963670 1.353498e-04 0.0002279463
##       3  0.99983087 6.838078e-05 0.0001007477
##       4  0.04862695 1.473530e-04 0.9512256930
##       5  0.99977716 7.502209e-05 0.0001478197
##       6  0.88745625 1.115747e-01 0.0009690774
##       7  0.64201252 1.828276e-01 0.1751598777
##       8  0.68601405 3.137507e-01 0.0002352900
##       9  0.70572260 9.754706e-02 0.1967303405
##       10 0.66359037 1.090356e-01 0.2273740408

We can say Document 1 loads heavily on topic 2 whereas document 2 loads heavily on topic 1. Document 3 is mix of topic 1 and topic 2

Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.

Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.

Based on the number of terms in DocumentTermMatrix lift calculation may take some time. In your case it should be completed in 2-3 minutes

t = Sys.time()
theta = simfit$theta
lift = theta*0;  sum1 = sum(dtm1)
for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]     # term i's probability of topic j membership
    pterm = sum(dtm1[,i])/sum1   # marginal probability of term i's occurrence in corpus
    lift[i, j] = ptermtopic/pterm   # so, lift is topic membership probability normalized by occurrence probability
      }
}
Sys.time() - t # Total time for calculating lift

## Time difference of 4.755205 mins

let’s print lift for first 10 terms. [Remember, Lift will have the same dimension as the \(\theta\) matrix.]

lift[1:10,]

##               topic
## phrase                   1            2            3
##   aberle       0.001019517 6.0350821306 5.615240e-05
##   abi-research 0.084124021 5.6881283121 2.810429e-05
##   abide        0.131918071 5.4886300011 3.381322e-05
##   abilities    0.008495463 4.7050410163 1.501643e+00
##   ability      1.255631905 0.0001351673 9.222535e-01
##   abroad       1.118650555 0.0209573890 1.559305e+00
##   absence      0.084449428 3.6370497365 2.369840e+00
##   absolute     0.011651448 4.6103491280 1.595861e+00
##   absorption   0.002254343 6.0298887416 3.370285e-05
##   abundant     0.005434595 6.0167006568 8.424925e-05

Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordclod will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])

# Plot wordcloud
wordcloud(rownames(top_word), top_word,  scale=c(4,0.5), 1,
          random.order=FALSE, random.color=FALSE, 
          colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}

To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. As we did in topic wordcloud, first we will find top terms for a topic and then we will construct a co-occurrence matrix. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges.

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])

# now for top 30 words let's find Document Term Matrix
mat  = dtm1[,match(row.names(top_word),colnames(dtm1))]

mat = as.matrix(mat)
cmat  = t(mat) %*% (mat)
diag(cmat) = 0

# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
  vec = cmat[p,]
  cutoff = sort(vec, decreasing = T)[2]
  cmat[p,][cmat[p,] < cutoff] = 0
}

#cmat[cmat <  quantile(cmat,.80)] = 0

graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)

plot(graph,         #the graph to be plotted
 layout=layout.fruchterman.reingold,    # the layout method. 
 vertex.frame.color='blue',         #the color of the border of the dots 
 vertex.label.color='black',        #the color of the name labels
 vertex.label.font=1,           #the font of the name labels
 vertex.size = .00001,   # Dots size
 vertex.label.cex=1.3)
 mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}

Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.

eta = function(mat, dtm) {

mat1 = mat/mean(mat);  terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))

for (i in 1:nrow(dtm)){
    a11 = as.data.frame(matrix(dtm[i,])); 
    rownames(a11) = colnames(dtm)
    a12 = as.matrix(a11[(a11>0),]);  
    rownames(a12) = rownames(a11)[(a11>0)]; 
    rownames(a12)[1:4]
    a13 = intersect(terms1, rownames(a12)); 
    a13[1:15];  length(a13)
    a14a = match(a13, terms1);      # positions of matching terms in mat1 matrix
    a14b = match(a13, rownames(a12))        
    a15 = mat1[a14a,]*matrix(rep(a12[a14b,], 
                                 ncol(mat1)), 
                             ncol = ncol(mat1))
    eta.mat = rbind(eta.mat, apply(a15, 2, mean))   
    rm(a11, a12, a13, a14a, a14b, a15)
  }
  eta.mat = eta.mat[2:nrow(eta.mat), ]  # remove top zeroes row
  row.names(eta.mat)=row.names(dtm)
  return(eta.mat)
}

twc = eta(lift, dtm1)
head(twc)

##          1          2         3
## 1 2.444581 0.01068431 0.6390536
## 2 1.687751 0.01428670 0.4558064
## 3 2.539287 0.01710608 0.5575338
## 4 1.226948 0.31564554 4.6157593
## 5 2.248299 0.01565627 0.7026517
## 6 1.931055 0.99765525 0.8907726

Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.

Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic

eta.file.name = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),1]     # Store first n companies name in list  
      }
  return(s)
}

temp1 = eta.file.name(twc,textdata,5)

for (i in 1:length(temp1)){
  print(paste('Companies loading heavily on topic',i,'are'))
  print(temp1[[i]])
  print('--------------------------')
}

## [1] "Companies loading heavily on topic 1 are"
## [1] "APPLIED MATERIALS INC"        "ADVANCED MICRO DEVICES"      
## [3] "TELEPHONE & DATA SYSTEMS INC" "ADOBE SYSTEMS INC"           
## [5] "CENTURYLINK INC"             
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "QUALCOMM INC"           "CORNING INC"           
## [3] "TRIMBLE NAVIGATION LTD" "HARRIS CORP"           
## [5] "ECHOSTAR CORP"         
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "ADOBE SYSTEMS INC"            "DST SYSTEMS INC"             
## [3] "AUTOMATIC DATA PROCESSING"    "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "IAC/INTERACTIVECORP"         
## [1] "--------------------------"

Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.

eta.file = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),2]     # Store first n documents in list  
      }
  return(s)
}

temp2 = eta.file(twc,textdata,5)

# for (i in 1:length(temp2)){
#   print(paste('Documents loading heavily on topic',i,'are'))
#   print(temp2[[i]])
#   print('--------------------------')
# }

Now based on wordclouds, document content and companies name we have interpretred and label topics.

Business Description : IT services & Solutions Justification : Upon doing k=3 we get to see that services and solutions around information technology and sofwatres is the focus area of the business.

Latent Topic Mining

Gaurav Arora 71610026 Sachin Kulkarni 71610065 Sandeep Shinde 71610066 Sanjeev Kumar 71610067

20 March 2016