We are going to do text mining of SEC 10-K documents business description to find insights and correlations of business description. As we allknow about the background, lets quickly jumpon to analysis part We are analyzing Business description of 85 organizations from Fortue 100
Topic Mining walkthrough clear work space and include necessary R Libraries
options(warn=-1)
rm(list=ls()) # Clear the workspace
library("tm")
library("wordcloud")
library("maptpx")
library("igraph")
library("rJava")
library("textir")
library("RWeka")
library("qdap")
Read the text file and Document term matrix (DTM)
textdata = readRDS(file.choose()) # Select BD.Technology.Rds data set
dtm_tf = readRDS(file.choose()) # Select dtm1.BD.Rds
the DTM we have is in TF (Term frequency) format. Lets create Inverted Document frequency DTM (TFIDF), use Custom DTM creation function used in class for the same
###########################################################
# Customize Document Term Matrix
custom.dtm = function(x1, # Text Corpus
scheme) # tf or tfidf
{
tdm = TermDocumentMatrix(x1)
a1 = apply(tdm, 1, sum)
a2 =((a1 >= 2))
tdm.new = tdm[a2, ]
# remove blank documents (i.e. columns with zero sums)
a0 = NULL;
for (i1 in 1:ncol(tdm.new)){ if (sum(tdm.new[, i1]) == 0) {a0 = c(a0, i1)} }
length(a0) # no. of empty docs in the corpus
if (length(a0) >0) { tdm.new1 = tdm.new[, -a0]} else {tdm.new1 = tdm.new};
dim(tdm.new1) # reduced tdm
if (scheme == "tfidf") {
x2mat = t(tfidf(tdm.new1))
}
else {x2mat = t((tdm.new1))}
return(x2mat)
}
###########################################################
Now convert the DTM_TF Document matrix back to corpus and recreate it as TFIDF DTM
dtm2list <- apply(dtm_tf, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
doc_corpus <- VCorpus(VectorSource(dtm2list))
Now the DTM is converted back to corpus as doc_corpus, Create TFIDF Document term matrix using word corpus
dtm_tfidf = custom.dtm(doc_corpus,"tfidf") # Term Frequency Inverse Document Frequency Scheme
As per LDA model we need to map these DTMs to topics,as we dot know the number of topics, lets start with topic as 2 and try our analysis, and then we can repeat for 3, 4 and 5 topics
K = 2 # Choose number of topics in the model
simfit = topics(dtm_tf, K = K, verb = 2) # Fit the K topic model
summary(simfit, nwrd = 25) # Summary of simfit model
##
## Top 25 phrases by topic-over-null term lift (and usage %):
##
## [1] 'adhesion', 'advantedge', 'appliedmaterials', 'baccini', 'centura', 'chipmaking', 'dpn', 'duv', 'endura', 'epitaxy', 'equipment-services', 'eterna', 'gates', 'hct', 'ions', 'ltps', 'microscopes', 'opc', 'pappis', 'pegaso', 'reflexion', 'sem', 'served-years', 'sige', 'sunfab' (76.3)
## [2] 'employee', 'employment', 'dealer-services', 'employers', 'insurance', 'employer-services', 'administration', 'hcm', 'talent', 'adp-totalsource', 'human-capital', 'digital-marketing', 'dms', 'employer', 'payment', 'retirement', 'dealer', 'administration-services', 'peo', 'wage', 'attendance', 'tax-credits', 'benefits-administration', 'services-adp', 'worksite-employees' (23.7)
##
## Dispersion = 2.93
we can see the top 25 words in each topic with Dispersion of 2.93
lets do document matrix analysis of these topics, D is the number of documents T is Term-tokens in corpus K is the no of topics
So lets fid out the probability of tokens(T) belonging to respective Topics (K),for this we need to build T X K matrix
Display Tokens X Topics matrix with decreasing order of total term probability
a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]
## topic
## phrase 1 2
## products 0.009976802 1.488795e-07
## customers 0.008570852 1.458728e-07
## services 0.008381531 3.307089e-07
## business 0.005006281 5.383343e-07
## provide 0.004555747 2.466252e-07
## solutions 0.004537316 1.538186e-06
## company 0.004461091 3.236113e-07
## technology 0.004199792 1.198603e-07
## vice-president 0.004145720 1.109598e-07
## systems 0.003902374 1.373013e-07
Similarly lets display D X K - Documents by topics matrix to understand how documets are assigned to topics.Displat this in decreasing order of probability
a0 = apply(simfit$omega, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$omega[a01[1:10],]
## topic
## document 1 2
## 78 0.8495473 0.1504526663
## 1 0.9998939 0.0001061231
## 4 0.7769296 0.2230703517
## 5 0.9998869 0.0001130663
## 6 0.8580421 0.1419578690
## 7 0.7570766 0.2429233754
## 8 0.6980697 0.3019303131
## 9 0.8249611 0.1750389206
## 10 0.7720353 0.2279647146
## 11 0.8203302 0.1796698181
InterpretationAs you can see the words which take dominance are products, customers, services,solutions etc.and term distribution between document 1 and 2 is uneven. Lets ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.
Calculating lift time for avoiding undue influence of few tokens over topics
t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm_tf)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm_tf[,i])/sum1 # marginal probability of term i's occurrence in corpus
lift[i, j] = ptermtopic/pterm # so, lift is topic membership probability normalized by occurrence }
}
}
Sys.time() - t # Total time for calculating lift
## Time difference of 2.277344 mins
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(3,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 2, line = 2, cex=1)
}
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 30 words let's find Document Term Matrix
mat = dtm_tf[,match(row.names(top_word),colnames(dtm_tf))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[2]
cmat[p,][cmat[p,] < cutoff] = 0
}
#cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 1, cex=1)
}
Interpretation Now we can see 1 word cloud and co-occurange graph for each topic. so what does these infer
Topic 1:
Title products and services are the key drivers of companies
Interpretation: This topic clearly splitted the nature of company’s business as a seperate topic. Predomantly used words are products, services and customers, the co-occurance words for these words are products (solutions, applications and systems), Services(operations, business, provide), customers(market,data, technology). This shows what are the key drivers of companies which they report in filings as company’s description.
Topic 2::
Title Technology powers the strength of Business
Interpretation: This topic carries the company names and specific technology strengths of the company. The most widely used words are Online, Adobe,manage and cellular and these have co-occurance with adobe(online, analytics,search), manage (teradata, payments, telecom) and cellular(telecom, payments). This shows that technology is key driver when companies claim about their strengths
eta = function(mat, dtm) {
mat1 = mat/mean(mat); terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))
for (i in 1:nrow(dtm)){
a11 = as.data.frame(matrix(dtm[i,]));
rownames(a11) = colnames(dtm)
a12 = as.matrix(a11[(a11>0),]);
rownames(a12) = rownames(a11)[(a11>0)];
rownames(a12)[1:4]
a13 = intersect(terms1, rownames(a12));
a13[1:15]; length(a13)
a14a = match(a13, terms1); # positions of matching terms in mat1 matrix
a14b = match(a13, rownames(a12))
a15 = mat1[a14a,]*matrix(rep(a12[a14b,],
ncol(mat1)),
ncol = ncol(mat1))
eta.mat = rbind(eta.mat, apply(a15, 2, mean))
rm(a11, a12, a13, a14a, a14b, a15)
}
eta.mat = eta.mat[2:nrow(eta.mat), ] # remove top zeroes row
8
row.names(eta.mat)=row.names(dtm)
return(eta.mat)
}
twc = eta(lift, dtm_tf)
head(twc)
## 1 2
## 1 2.386493 0.012333214
## 2 1.469910 0.588225140
## 3 2.462572 0.004228185
## 4 1.626116 1.487447775
## 5 2.220313 0.004720855
## 6 1.852733 1.048145495
eta.file.name = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),1] # Store first n companies name in list
}
return(s)
}
temp1 = eta.file.name(twc,textdata,5)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}
## [1] "Companies loading heavily on topic 1 are"
## [1] "ADOBE SYSTEMS INC" "APPLIED MATERIALS INC"
## [3] "ADVANCED MICRO DEVICES" "TELEPHONE & DATA SYSTEMS INC"
## [5] "CENTURYLINK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "ADOBE SYSTEMS INC" "DST SYSTEMS INC"
## [3] "ORACLE CORP" "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "TRIMBLE NAVIGATION LTD"
## [1] "--------------------------"
Topic 1:
Title products and services are the key drivers of companies
Interpretation: This topic clearly splitted the nature of company’s business as a seperate topic. Predomantly used words are products, services and customers, the co-occurance words for these words are products (solutions, applications and systems), Services(operations, business, provide), customers(market,data, technology). This shows what are the key drivers of companies which they report in filings as company’s description.
The main companies in this topic are Adobe, AMD and Century link,these 3 companies have unique products/services which drive their business.
Topic 2::
Title Technology powers the strength of Business
Interpretation: This topic carries the company names and specific technology strengths of the company. The most widely used words are Online, Adobe,manage and cellular and these have co-occurance with adobe(online, analytics,search), manage (teradata, payments, telecom) and cellular(telecom, payments). This shows that technology is key driver when companies claim about their strengths
The main companies in this topic are Oracle and Timberlink with Adove repeating in this topic aswell, as interpreted,these companies along with products/services, Technology is main driver for their success and marketshare.
The Topic mining on Company documents is an important source of information to understand the background of companies on a trend. It heps identifying outliers on specific trend. In industries like Stock trading, it is not humanly possible to study the huge volume of documents, Topic mining helps in identifying outliers, on whom specific market trend can be studied toprovide market recommendations
———Thank You———