Latent Topic Mining

Pavan Gali : 71610047

In this article we will look how to fit a Latent Topic model and interpret the topics. For illustration the topic analysis based on SEC 10-K filings of 20 randomly chosen firms in the Technology sector from the Fortune 1000.

The first step in topic mining is to process the text data and create a Document Term Matrix. For fitting and visualising a Latent Topic Model, we will need three packages.

tm -> for creating DocumentTermMatrix
maptpx -> for fitting Latent Topic Model
wordcloud -> for plotting wordclouds
igraph -> for plotting terms co-occurrence graph

If you have not installed these packages, please install them using command install.packages(“package-name”).

Clear the workspace and invoke required libraries

rm(list=ls()) # Clear the workspace

library("tm")
library("wordcloud")
library("maptpx")
library("igraph")

Lets read the documents textdata and dtm1. The typical data mining is done by cleaning the data using the removepunctuations, stripwhitespaces and more funcitons.

textdata <- readRDS("C:\\Users\\User\\Desktop\\RF.Technology.Rds") 
dtm1<- readRDS("C:\\Users\\User\\Desktop\\dtm1.RF.Rds")

Lets create a Corpus to read the articles from the documents.

RFCorpus <- Corpus(VectorSource(as.character(textdata$RF.Technology.Rds))) 

summary(RFCorpus)

##  Length  Class1  Class2    Mode 
##       0 VCorpus  Corpus    list

A term-document matrix where those terms are removed which have at least a sparse percentage of empty elements.The resulting matrix contains only terms with a sparse factor of less than sparse.

dtm1 <- removeSparseTerms(dtm1, 0.6)

To know how many topics are really there, we did trial-and-error. We choose K = 2, then 3, then 4 and so on till we think the results look reasonable and interpretable.

As per our trial and error method, result for K=3 looks more reasonable and interpretable.

K = 3# Choose number of topics in the model
simfit = topics(dtm1,  K = K, verb = 2) # Fit the K topic model

## 
## Estimating on a 85 document collection.
## Fitting the 3 topic model.
## log posterior increase: 50.1, done. (L = -1152277.1)

summary(simfit, nwrd = 12)  # Summary of simfit model

## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'products', 'customers', 'materially', 'adversely-affected', 'intellectual-property', 'china', 'suppliers', 'manufacturing', 'adverse-effect', 'material', 'supply', 'including' (95.3) 
## [2] 'statements', 'annual-report', 'violations', 'forward', 'aspects', 'future-results', 'manner', 'breaches', 'derive', 'evaluate', 'attract', 'important' (3.3) 
## [3] 'tax', 'risk-factors', 'operate', 'employees', 'retain', 'suffer', 'employment', 'jurisdictions', 'announced', 'form', 'stockholders', 'expiration' (1.5) 
## 
## Dispersion = 2.48

A collection of ‘n’ documents and T-term tokens in the corpus are fitted with K topics.

The outputs are displayed in terms of probability matrices with a dimension of T x K.

One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.

Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.

E.g., let’s view the term probability matrix \(\theta\):

simfit$theta[1:10,]

##                        topic
## phrase                             1            2            3
##   ability               1.007414e-02 3.321229e-03 3.137586e-04
##   acceptable-terms      5.506024e-05 4.878380e-03 1.320778e-02
##   access                1.939997e-03 3.665013e-04 2.046311e-04
##   accordance            3.665511e-04 6.497534e-06 5.436546e-05
##   accounting            4.915475e-04 1.252949e-05 1.350030e-04
##   accounting-principles 4.003965e-04 9.470456e-06 9.074803e-05
##   accounts-receivable   5.337497e-04 2.492922e-06 1.540987e-05
##   accurately            3.210940e-04 3.922342e-06 2.597047e-05
##   achieve               1.297422e-03 3.453835e-06 1.725366e-05
##   acquire               1.060404e-03 9.848305e-06 1.144020e-04

Let’s sort this matrix with decreasing order of total term probability and check the few top terms.

a0 = apply(simfit$theta, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]

##                      topic
## phrase                           1            2            3
##   operate             0.0006837280 1.064828e-04 1.003049e-01
##   employees           0.0007553650 8.590988e-05 9.347924e-02
##   stockholders        0.0001503744 1.811481e-02 4.749384e-02
##   retain              0.0005521713 1.138674e-04 6.020645e-02
##   jurisdictions       0.0005690467 1.757184e-04 5.148380e-02
##   tax                 0.0002912274 1.236657e-04 5.172475e-02
##   suffer              0.0004550136 1.154822e-04 4.957057e-02
##   financial-reporting 0.0001320058 1.457780e-02 3.099727e-02
##   decrease            0.0007128157 1.293427e-04 4.173438e-02
##   businesses          0.0010277168 3.985534e-02 5.156678e-05

Here you can see product, services, applications etc have higher probability for topic 4. Similarly we can see the \(\omega\) matrix for documents.

simfit$omega[1:10,]

##         topic
## document         1            2            3
##       1  0.9990300 0.0004700323 0.0005000104
##       2  0.9952350 0.0028242523 0.0019407306
##       3  0.9984065 0.0005330962 0.0010603684
##       4  0.2345504 0.6674531523 0.0979964715
##       5  0.9889711 0.0030484124 0.0079804934
##       6  0.9542467 0.0315701391 0.0141831304
##       7  0.9488142 0.0375093298 0.0136765063
##       8  0.9501433 0.0342041575 0.0156525868
##       9  0.9449447 0.0293879820 0.0256672807
##       10 0.9669141 0.0113266898 0.0217591919

We can say Document 1 loads heavily on topic 2 whereas document 2 loads heavily on topic 1. Document 3 is mix of topic 1 and topic 2.

Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.

Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.

Based on the number of terms in DocumentTermMatrix lift calculation may take some time.

t = Sys.time()
theta = simfit$theta
lift = theta*0;  sum1 = sum(dtm1)
for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]     # term i's probability of topic j membership
    pterm = sum(dtm1[,i])/sum1   # marginal probability of term i's occurrence in corpus
    lift[i, j] = ptermtopic/pterm   # so, lift is topic membership probability normalized by occurrence probability
      }
}
Sys.time() - t # Total time for calculating lift

## Time difference of 4.966544 secs

let’s print lift for first 10 terms.

lift[1:10,]

##                        topic
## phrase                          1            2           3
##   ability               1.0310682  0.339921193  0.03211257
##   acceptable-terms      0.1534537 13.596121266 36.81029661
##   access                1.0336535  0.195276183  0.10902985
##   accordance            1.0375457  0.018391675  0.15388483
##   accounting            1.0354277  0.026392937  0.28437915
##   accounting-principles 1.0362033  0.024509000  0.23485072
##   accounts-receivable   1.0396986  0.004855998  0.03001710
##   accurately            1.0387163  0.012688495  0.08401261
##   achieve               1.0399816  0.002768508  0.01383010
##   acquire               1.0383704  0.009643673  0.11202487

Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordclod will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic.

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
clean1<-tm_map(RFCorpus, removeWords, stopwords("english"))
clean2<-tm_map(clean1, removeWords, c("protect","fail"))
# Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])

# Plot wordcloud
wordcloud(rownames(top_word), top_word,  scale=c(4,0.5), 1,
          random.order=FALSE, random.color=FALSE, 
          colors=brewer.pal(8, "Dark2"))

mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}

Now, lets lable and intrepret the wordcloud topics to get a clear picture of the topics. Lets plot top 20 terms co-occurrence graph.

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])

# now for top 30 words let's find Document Term Matrix
mat  = dtm1[,match(row.names(top_word),colnames(dtm1))]

mat = as.matrix(mat)
cmat  = t(mat) %*% (mat)
diag(cmat) = 0

# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
  vec = cmat[p,]
  cutoff = sort(vec, decreasing = T)[2]
  cmat[p,][cmat[p,] < cutoff] = 0
}

#cmat[cmat <  quantile(cmat,.80)] = 0

graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)

plot(graph,         #the graph to be plotted
 layout=layout.fruchterman.reingold,    # the layout method. 
 vertex.frame.color='blue',         #the color of the border of the dots 
 vertex.label.color='black',        #the color of the name labels
 vertex.label.font=1,           #the font of the name labels
 vertex.size = .00001,   # Dots size
 vertex.label.cex=1.3)
 mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}

Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.

eta = function(mat, dtm) {

mat1 = mat/mean(mat);  terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))

for (i in 1:nrow(dtm)){
    a11 = as.data.frame(matrix(dtm[i,])); 
    rownames(a11) = colnames(dtm)
    a12 = as.matrix(a11[(a11>0),]);  
    rownames(a12) = rownames(a11)[(a11>0)]; 
    rownames(a12)[1:4]
    a13 = intersect(terms1, rownames(a12)); 
    a13[1:15];  length(a13)
    a14a = match(a13, terms1);      # positions of matching terms in mat1 matrix
    a14b = match(a13, rownames(a12))        
    a15 = mat1[a14a,]*matrix(rep(a12[a14b,], 
                                 ncol(mat1)), 
                             ncol = ncol(mat1))
    eta.mat = rbind(eta.mat, apply(a15, 2, mean))   
    rm(a11, a12, a13, a14a, a14b, a15)
  }
  eta.mat = eta.mat[2:nrow(eta.mat), ]  # remove top zeroes row
  row.names(eta.mat)=row.names(dtm)
  return(eta.mat)
}

twc = eta(lift, dtm1)
head(twc)

##          1        2        3
## 1 3.464351 1.781464 1.376340
## 2 2.282775 1.435010 1.103567
## 3 2.804509 1.280848 1.204788
## 4 1.133522 8.142493 5.136392
## 5 3.117328 2.178786 2.325941
## 6 2.200273 2.485371 2.218339

Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.

Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic

eta.file.name = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),1]     # Store first n companies name in list  
      }
  return(s)
}

temp1 = eta.file.name(twc,textdata,5)

for (i in 1:length(temp1)){
  print(paste('Companies loading heavily on topic',i,'are'))
  print(temp1[[i]])
  print('--------------------------')
}

## [1] "Companies loading heavily on topic 1 are"
## [1] "SUNPOWER CORP"                "LEAP WIRELESS INTL INC"      
## [3] "TELEPHONE & DATA SYSTEMS INC" "SANDISK CORP"                
## [5] "FIRST SOLAR INC"             
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "AUTOMATIC DATA PROCESSING"    "TELEPHONE & DATA SYSTEMS INC"
## [3] "IAC/INTERACTIVECORP"          "LEVEL 3 COMMUNICATIONS INC"  
## [5] "SUNPOWER CORP"               
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "SUNPOWER CORP"             "FACEBOOK INC"             
## [3] "AUTOMATIC DATA PROCESSING" "LEAP WIRELESS INTL INC"   
## [5] "GROUPON INC"              
## [1] "--------------------------"

Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.

eta.file = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),2]     # Store first n documents in list  
      }
  return(s)
}

temp2 = eta.file(twc,textdata,5)

# for (i in 1:length(temp2)){
#   print(paste('Documents loading heavily on topic',i,'are'))
#   print(temp2[[i]])
#   print('--------------------------')
# }

Latent Topic Mining

20 March 2016

PRESENTED BY :::

Dingi A S R Nandan : 71610023

Sowjanya Addala : 71610083

Harish Pullamaraju : 71610028

Pavan Gali : 71610047

Now based on wordclouds, document content and companies topics are interpretted and labelled.

Latent Topic 1

Latent Topic 2

Latent Topic 3

THANK YOU!!