Group Assignment 3 - Topic Mining

Overview:

We are going to do text mining of SEC 10-K documents business description to find insights and correlations of business description. As we allknow about the background, lets quickly jumpon to analysis part We are analyzing Business description of 85 organizations from Fortue 100

Topic Mining walkthrough clear work space and include necessary R Libraries

options(warn=-1)
rm(list=ls()) # Clear the workspace
library("tm")
library("wordcloud")
library("maptpx")
library("igraph")
library("rJava")
library("textir")
library("RWeka")
library("qdap")

Read the text file and Document term matrix (DTM)

textdata = readRDS(file.choose()) # Select BD.Technology.Rds  data set
dtm_tf = readRDS(file.choose()) # Select dtm1.BD.Rds

the DTM we have is in TF (Term frequency) format. Lets create Inverted Document frequency DTM (TFIDF), use Custom DTM creation function used in class for the same

###########################################################
# Customize Document Term Matrix
custom.dtm  = function(x1,                               # Text Corpus
                       scheme)                           # tf or tfidf
{
  tdm = TermDocumentMatrix(x1)
  a1 = apply(tdm, 1, sum)
  a2 =((a1 >= 2))
  tdm.new = tdm[a2, ]
  # remove blank documents (i.e. columns with zero sums)
  a0 = NULL; 
  for (i1 in 1:ncol(tdm.new)){ if (sum(tdm.new[, i1]) == 0) {a0 = c(a0, i1)} }
  length(a0)    # no. of empty docs in the corpus
  if (length(a0) >0) { tdm.new1 = tdm.new[, -a0]} else {tdm.new1 = tdm.new};
  
  dim(tdm.new1)    # reduced tdm
  if (scheme == "tfidf") {
    x2mat = t(tfidf(tdm.new1))
  }
  else {x2mat = t((tdm.new1))}
  return(x2mat)
}
###########################################################

Now convert the DTM_TF Document matrix back to corpus and recreate it as TFIDF DTM

dtm2list <- apply(dtm_tf, 1, function(x) {
  paste(rep(names(x), x), collapse=" ")
})
doc_corpus <- VCorpus(VectorSource(dtm2list))

Now the DTM is converted back to corpus as doc_corpus, Create TFIDF Document term matrix using word corpus

dtm_tfidf = custom.dtm(doc_corpus,"tfidf")            # Term Frequency Inverse Document Frequency Scheme

As per LDA model we need to map these DTMs to topics,as we dot know the number of topics, lets start with topic as 2 and try our analysis, and then we can repeat for 3, 4 and 5 topics

K = 2 # Choose number of topics in the model
simfit = topics(dtm_tf, K = K, verb = 2) # Fit the K topic model

summary(simfit, nwrd = 25) # Summary of simfit model

## 
## Top 25 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'adhesion', 'advantedge', 'appliedmaterials', 'baccini', 'centura', 'chipmaking', 'dpn', 'duv', 'endura', 'epitaxy', 'equipment-services', 'eterna', 'gates', 'hct', 'ions', 'ltps', 'microscopes', 'opc', 'pappis', 'pegaso', 'reflexion', 'sem', 'served-years', 'sige', 'sunfab' (76.3) 
## [2] 'employee', 'employment', 'dealer-services', 'employers', 'insurance', 'employer-services', 'administration', 'hcm', 'talent', 'adp-totalsource', 'human-capital', 'digital-marketing', 'dms', 'employer', 'payment', 'retirement', 'dealer', 'administration-services', 'peo', 'wage', 'attendance', 'tax-credits', 'benefits-administration', 'services-adp', 'worksite-employees' (23.7) 
## 
## Dispersion = 2.93

we can see the top 25 words in each topic with Dispersion of 2.93

lets do document matrix analysis of these topics, D is the number of documents T is Term-tokens in corpus K is the no of topics

So lets fid out the probability of tokens(T) belonging to respective Topics (K),for this we need to build T X K matrix

Display Tokens X Topics matrix with decreasing order of total term probability

a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]

##                 topic
## phrase                     1            2
##   products       0.009976802 1.488795e-07
##   customers      0.008570852 1.458728e-07
##   services       0.008381531 3.307089e-07
##   business       0.005006281 5.383343e-07
##   provide        0.004555747 2.466252e-07
##   solutions      0.004537316 1.538186e-06
##   company        0.004461091 3.236113e-07
##   technology     0.004199792 1.198603e-07
##   vice-president 0.004145720 1.109598e-07
##   systems        0.003902374 1.373013e-07

Similarly lets display D X K - Documents by topics matrix to understand how documets are assigned to topics.Displat this in decreasing order of probability

a0 = apply(simfit$omega, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$omega[a01[1:10],]

##         topic
## document         1            2
##       78 0.8495473 0.1504526663
##       1  0.9998939 0.0001061231
##       4  0.7769296 0.2230703517
##       5  0.9998869 0.0001130663
##       6  0.8580421 0.1419578690
##       7  0.7570766 0.2429233754
##       8  0.6980697 0.3019303131
##       9  0.8249611 0.1750389206
##       10 0.7720353 0.2279647146
##       11 0.8203302 0.1796698181

InterpretationAs you can see the words which take dominance are products, customers, services,solutions etc.and term distribution between document 1 and 2 is uneven. Lets ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

Calculating lift time for avoiding undue influence of few tokens over topics

t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm_tf)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm_tf[,i])/sum1 # marginal probability of term i's occurrence in corpus
lift[i, j] = ptermtopic/pterm # so, lift is topic membership probability normalized by occurrence }
}
}
Sys.time() - t # Total time for calculating lift

## Time difference of 2.277344 mins

for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(3,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 2, line = 2, cex=1)
}

for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 30 words let's find Document Term Matrix
mat = dtm_tf[,match(row.names(top_word),colnames(dtm_tf))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[2]
cmat[p,][cmat[p,] < cutoff] = 0
}
#cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 1, cex=1)
}

Interpretation Now we can see 1 word cloud and co-occurange graph for each topic. so what does these infer

Topic 1:

Title products and services are the key drivers of companies

Interpretation: This topic clearly splitted the nature of company’s business as a seperate topic. Predomantly used words are products, services and customers, the co-occurance words for these words are products (solutions, applications and systems), Services(operations, business, provide), customers(market,data, technology). This shows what are the key drivers of companies which they report in filings as company’s description.

Topic 2::

Title Technology powers the strength of Business

Interpretation: This topic carries the company names and specific technology strengths of the company. The most widely used words are Online, Adobe,manage and cellular and these have co-occurance with adobe(online, analytics,search), manage (teradata, payments, telecom) and cellular(telecom, payments). This shows that technology is key driver when companies claim about their strengths

Calculate weghtage scheme & Topic proportion to identify companies heavily on each topic

eta = function(mat, dtm) {
mat1 = mat/mean(mat); terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))
for (i in 1:nrow(dtm)){
a11 = as.data.frame(matrix(dtm[i,]));
rownames(a11) = colnames(dtm)
a12 = as.matrix(a11[(a11>0),]);
rownames(a12) = rownames(a11)[(a11>0)];
rownames(a12)[1:4]
a13 = intersect(terms1, rownames(a12));
a13[1:15]; length(a13)
a14a = match(a13, terms1); # positions of matching terms in mat1 matrix
a14b = match(a13, rownames(a12))
a15 = mat1[a14a,]*matrix(rep(a12[a14b,],
ncol(mat1)),
ncol = ncol(mat1))
eta.mat = rbind(eta.mat, apply(a15, 2, mean))
rm(a11, a12, a13, a14a, a14b, a15)
}
eta.mat = eta.mat[2:nrow(eta.mat), ] # remove top zeroes row
8
row.names(eta.mat)=row.names(dtm)
return(eta.mat)
}
twc = eta(lift, dtm_tf)
head(twc)

##          1           2
## 1 2.386493 0.012333214
## 2 1.469910 0.588225140
## 3 2.462572 0.004228185
## 4 1.626116 1.487447775
## 5 2.220313 0.004720855
## 6 1.852733 1.048145495

Identify top 3 companies in each Topic

eta.file.name = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),1] # Store first n companies name in list
}
return(s)
}
temp1 = eta.file.name(twc,textdata,5)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}

## [1] "Companies loading heavily on topic 1 are"
## [1] "ADOBE SYSTEMS INC"            "APPLIED MATERIALS INC"       
## [3] "ADVANCED MICRO DEVICES"       "TELEPHONE & DATA SYSTEMS INC"
## [5] "CENTURYLINK INC"             
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "ADOBE SYSTEMS INC"            "DST SYSTEMS INC"             
## [3] "ORACLE CORP"                  "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "TRIMBLE NAVIGATION LTD"      
## [1] "--------------------------"

Reaffirm the Interpretation based on Top 3 companies for each Topic

Topic 1:

Title products and services are the key drivers of companies

The main companies in this topic are Adobe, AMD and Century link,these 3 companies have unique products/services which drive their business.

Topic 2::

Title Technology powers the strength of Business

The main companies in this topic are Oracle and Timberlink with Adove repeating in this topic aswell, as interpreted,these companies along with products/services, Technology is main driver for their success and marketshare.

Industry Applicability

The Topic mining on Company documents is an important source of information to understand the background of companies on a trend. It heps identifying outliers on specific trend. In industries like Stock trading, it is not humanly possible to study the huge volume of documents, Topic mining helps in identifying outliers, on whom specific market trend can be studied toprovide market recommendations

———Thank You———