This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Clear the workspace and invoke required libraries
rm(list=ls()) # Clear the workspace
library("tm")
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.3
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.2.3
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library("maptpx")
## Warning: package 'maptpx' was built under R version 3.2.3
## Loading required package: slam
## Warning: package 'slam' was built under R version 3.2.3
library("igraph")
## Warning: package 'igraph' was built under R version 3.2.3
##
## Attaching package: 'igraph'
## The following object is masked from 'package:maptpx':
##
## normalize
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.2.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library("reshape")
## Warning: package 'reshape' was built under R version 3.2.4
library("DT")
## Warning: package 'DT' was built under R version 3.2.4
##
## Attaching package: 'DT'
## The following object is masked from 'package:igraph':
##
## %>%
Read ITEM 1A. Risk Factor related data that has been scraped and provided in ‘RF.Technology.Rds’
textdata = readRDS(file.choose()) # Select RF.Technology.Rds data set
textdata is a data frame with 2 columns. First column is Company name and second column is text of Item 1A- Risk Factors of Form 10-K.
Select the Document Term Matrix (DTM). The DTM has been created from the second column of textdata
dtm1 = readRDS(file.choose()) # Select dtm1.RF.Rds
The LDA Model requires us to specify how many latent topics (K) we think exist in the corpus. As we do not the number of latent topics we start with K=2, K=3, K=4, K=5 to see which K yields the ‘best’ or most interpret-able and clear story
K = 2 # Choose number of topics in the model
simfit = topics(dtm1, K = K, verb = 2) # Fit the K topic model
##
## Estimating on a 85 document collection.
## Fitting the 2 topic model.
## log posterior increase: 111.9, done. (L = -2347579.3)
Display the summary of top 12 phrases
summary(simfit, nwrd = 12) # Summary of simfit model
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'abandon', 'atic', 'dominance', 'fastest', 'instruction', 'pack', 'sunnyvale', 'amk', 'applied-products', 'linewidths', 'varian', 'dense' (80.2)
## [2] 'peo', 'adp', 'motivated', 'employer', 'multitude', 'transmitter', 'statutes', 'withheld', 'worsening', 'highest', 'security-systems', 'borrowers' (19.8)
##
## Dispersion = 2.12
Let us examine the outputs of the topic model. Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics.One, a theta matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.
simfit$theta[1:10,]
## topic
## phrase 1 2
## abandon 8.417557e-06 8.040463e-10
## abandoned 3.785483e-05 9.538075e-08
## abandonment 4.835164e-07 3.196205e-05
## abatement 1.297934e-06 4.563540e-05
## abide 4.718840e-07 3.200891e-05
## abilities 2.886747e-06 1.579139e-04
## ability 7.447915e-03 5.827765e-06
## abroad 1.851574e-04 1.003033e-07
## abruptly 9.114345e-07 3.023824e-05
## absence 2.351180e-06 1.431172e-04
theta <- simfit$theta[1:10,]
Let’s sort this matrix with decreasing order of total term probability and check the few top terms
a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
# simfit$theta[a01[1:10],]
datatable(simfit$theta[a01[],], class='compact')
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
Two, a omega document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.
simfit$omega[1:10,]
## topic
## document 1 2
## 1 0.9998552 0.0001448060
## 2 0.9995329 0.0004670551
## 3 0.9998276 0.0001724272
## 4 0.5186917 0.4813083067
## 5 0.9997058 0.0002941557
## 6 0.8072293 0.1927706932
## 7 0.7768235 0.2231765152
## 8 0.7928529 0.2071471166
## 9 0.8002837 0.1997162711
## 10 0.7840141 0.2159859383
We see that Documents 1 to 10 heavily on Topic 1 and Document 4 is a mix of Topic 1 and Topic 2. Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’. The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term. Based on the number of terms in DocumentTermMatrix lift calculation may take some time.
t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm1)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm1[,i])/sum1 # marginal probability of term i's occurrence in corpus
lift[i, j] = ptermtopic/pterm # so, lift is topic membership probability normalized by occurrence probability
}
}
Sys.time() - t # Total time for calculating lift
## Time difference of 1.806305 mins
let’s print lift for first 10 terms.
# lift[1:10,]
datatable(lift[], class='compact')
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordclod will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(4,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}
From these wordclouds we can label and interpret topics. Topic 1 highlights that the risk factors have to do with business, products, results, customers, operations etc. Topic 2 highlights TDS, change, business reputation etc.
To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges.
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 30 words let's find Document Term Matrix
mat = dtm1[,match(row.names(top_word),colnames(dtm1))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[2]
cmat[p,][cmat[p,] < cutoff] = 0
}
#cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}
Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here is a function to calculate topic proportion in documents.
eta = function(mat, dtm) {
mat1 = mat/mean(mat); terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))
for (i in 1:nrow(dtm)){
a11 = as.data.frame(matrix(dtm[i,]));
rownames(a11) = colnames(dtm)
a12 = as.matrix(a11[(a11>0),]);
rownames(a12) = rownames(a11)[(a11>0)];
rownames(a12)[1:4]
a13 = intersect(terms1, rownames(a12));
a13[1:15]; length(a13)
a14a = match(a13, terms1); # positions of matching terms in mat1 matrix
a14b = match(a13, rownames(a12))
a15 = mat1[a14a,]*matrix(rep(a12[a14b,],
ncol(mat1)),
ncol = ncol(mat1))
eta.mat = rbind(eta.mat, apply(a15, 2, mean))
rm(a11, a12, a13, a14a, a14b, a15)
}
eta.mat = eta.mat[2:nrow(eta.mat), ] # remove top zeroes row
row.names(eta.mat)=row.names(dtm)
return(eta.mat)
}
twc = eta(lift, dtm1)
head(twc)
## 1 2
## 1 1.8058129 0.07338217
## 2 1.3752730 0.12560819
## 3 1.7365399 0.05202135
## 4 0.6121672 1.86188229
## 5 1.6841064 0.13873854
## 6 1.2361259 1.18955013
Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics. Here is a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then the function is called with required arguments and printing the company names for each topic.
eta.file.name = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),1] # Store first n companies name in list
}
return(s)
}
temp1 = eta.file.name(twc,textdata,5)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}
## [1] "Companies loading heavily on topic 1 are"
## [1] "SUNPOWER CORP" "ADVANCED MICRO DEVICES"
## [3] "SANDISK CORP" "APPLIED MATERIALS INC"
## [5] "LEAP WIRELESS INTL INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "SPRINT CORP" "TELEPHONE & DATA SYSTEMS INC"
## [3] "SUNPOWER CORP" "FACEBOOK INC"
## [5] "BROCADE COMMUNICATIONS SYS"
## [1] "--------------------------"
Similarly we can find top text document. Since these documents are very large theya are not being printed here.
eta.file = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),2] # Store first n documents in list
}
return(s)
}
temp2 = eta.file(twc,textdata,5)
# for (i in 1:length(temp2)){
# print(paste('Documents loading heavily on topic',i,'are'))
# print(temp2[[i]])
# print('--------------------------')
# }
Here is a function to repeat the study with K topics in the model
studyLDAModel = function(K) {
simfit = topics(dtm1, K = K, verb = 2) # Fit the K topic model
summary(simfit, nwrd = 12) # Summary of simfit model
simfit$theta[1:10,]
theta <- simfit$theta[1:10,]
a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]
simfit$omega[1:10,]
t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm1)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm1[,i])/sum1 # marginal probability of term i's occurrence in corpus
lift[i, j] = ptermtopic/pterm # so, lift is topic membership probability normalized by occurrence probability
}
}
Sys.time() - t # Total time for calculating lift
lift[1:10,]
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(4,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 30 words let's find Document Term Matrix
mat = dtm1[,match(row.names(top_word),colnames(dtm1))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[2]
cmat[p,][cmat[p,] < cutoff] = 0
}
#cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}
twc = eta(lift, dtm1)
head(twc)
temp1 = eta.file.name(twc,textdata,5)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}
}
K=3
studyLDAModel(K)
##
## Estimating on a 85 document collection.
## Fitting the 3 topic model.
## log posterior increase: 890.2, 1133.3, done. (L = -2335640.4)
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'abandon', 'atic', 'dominance', 'fastest', 'instruction', 'pack', 'sunnyvale', 'amk', 'applied-products', 'linewidths', 'varian', 'manufacture-product' (75.3)
## [2] 'access-facilities', 'blurred', 'control-management', 'dated', 'deploys', 'mvno', 'revenues-future', 'subsidized', 'waiting', 'dividing', 'holdco', 'sprint-clearwire' (13.1)
## [3] 'multitude', 'withheld', 'broadridge', 'censures', 'solutions-business', 'intentionally', 'clients-customers', 'clearpath', 'adequate-security', 'acquirors', 'admissions', 'cerner' (11.6)
##
## Dispersion = 2
## [1] "Companies loading heavily on topic 1 are"
## [1] "SUNPOWER CORP" "ADVANCED MICRO DEVICES"
## [3] "APPLIED MATERIALS INC" "SANDISK CORP"
## [5] "LEAP WIRELESS INTL INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "SPRINT CORP" "TELEPHONE & DATA SYSTEMS INC"
## [3] "SUNPOWER CORP" "BROCADE COMMUNICATIONS SYS"
## [5] "ECHOSTAR CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "IAC/INTERACTIVECORP" "GROUPON INC"
## [3] "CERNER CORP" "AUTOMATIC DATA PROCESSING"
## [5] "BROADRIDGE FINANCIAL SOLUTNS"
## [1] "--------------------------"
K=4
studyLDAModel(K)
##
## Estimating on a 85 document collection.
## Fitting the 4 topic model.
## log posterior increase: 3545.4, 1671.9, 1348.9, 693.1, 463.3, done. (L = -2314418.5)
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'abandon', 'atic', 'dominance', 'fastest', 'instruction', 'pack', 'sunnyvale', 'amk', 'applied-products', 'linewidths', 'varian', 'dense' (59.6)
## [2] 'consequent', 'mdash', 'repealed', 'affect-reported', 'company-customers', 'facilities-equipment', 'hai', 'hon', 'kill', 'lived', 'pricings', 'quick' (16.4)
## [3] 'access-facilities', 'blurred', 'control-management', 'dated', 'deploys', 'mvno', 'revenues-future', 'subsidized', 'waiting', 'rollout', 'household', 'sign' (14)
## [4] 'withheld', 'broadridge', 'censures', 'solutions-business', 'clearpath', 'intentionally', 'adequate-security', 'clients-customers', 'admissions', 'cerner', 'coding', 'diagnosis' (10)
##
## Dispersion = 1.88
## [1] "Companies loading heavily on topic 1 are"
## [1] "SUNPOWER CORP" "ADVANCED MICRO DEVICES"
## [3] "APPLIED MATERIALS INC" "SANDISK CORP"
## [5] "CENTURYLINK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "CISCO SYSTEMS INC" "JABIL CIRCUIT INC"
## [3] "SANDISK CORP" "AMKOR TECHNOLOGY INC"
## [5] "TELEDYNE TECHNOLOGIES INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "SPRINT CORP" "TELEPHONE & DATA SYSTEMS INC"
## [3] "LEAP WIRELESS INTL INC" "SUNPOWER CORP"
## [5] "FACEBOOK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 4 are"
## [1] "MANTECH INTL CORP" "CERNER CORP"
## [3] "BROADRIDGE FINANCIAL SOLUTNS" "UNISYS CORP"
## [5] "COMPUTER SCIENCES CORP"
## [1] "--------------------------"
K=5
studyLDAModel(K)
##
## Estimating on a 85 document collection.
## Fitting the 5 topic model.
## log posterior increase: 1271.3, 2374.8, 1480.8, 855.8, 702.2, 514.6, 470.4, done. (L = -2305161.6)
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'amk', 'applied-products', 'linewidths', 'varian', 'affect-applied', 'general-industry', 'inflections', 'tvs', 'abandon', 'atic', 'dominance', 'fastest' (46)
## [2] 'access-facilities', 'blurred', 'control-management', 'dated', 'deploys', 'mvno', 'revenues-future', 'subsidized', 'waiting', 'rollout', 'acs', 'baseline' (21.8)
## [3] 'manufacture-product', 'axis', 'guarantor', 'rejection', 'removable', 'spills', 'wdc', 'yokkaichi', 'unsold', 'expenses-including', 'controllers', 'lsi' (14.3)
## [4] 'broadridge', 'censures', 'solutions-business', 'intentionally', 'clearpath', 'adequate-security', 'clients-customers', 'admissions', 'cerner', 'coding', 'diagnosis', 'displaying' (12.3)
## [5] 'reputation-harm', 'depreciating', 'disadvantageous', 'equity-debt', 'federal-tax', 'foot', 'payouts', 'stockholders-including', 'tanks', 'trs', 'cabinet', 'prevailing-market' (5.6)
##
## Dispersion = 1.82
## [1] "Companies loading heavily on topic 1 are"
## [1] "APPLIED MATERIALS INC" "ADVANCED MICRO DEVICES"
## [3] "SUNPOWER CORP" "CISCO SYSTEMS INC"
## [5] "CENTURYLINK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "SPRINT CORP" "TELEPHONE & DATA SYSTEMS INC"
## [3] "LEAP WIRELESS INTL INC" "FACEBOOK INC"
## [5] "SUNPOWER CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "SANDISK CORP" "AMKOR TECHNOLOGY INC" "WESTERN DIGITAL CORP"
## [4] "JABIL CIRCUIT INC" "SANMINA CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 4 are"
## [1] "MANTECH INTL CORP" "CERNER CORP"
## [3] "UNISYS CORP" "BROADRIDGE FINANCIAL SOLUTNS"
## [5] "COMPUTER SCIENCES CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 5 are"
## [1] "ECHOSTAR CORP" "EQUINIX INC"
## [3] "IRON MOUNTAIN INC" "TRIMBLE NAVIGATION LTD"
## [5] "ITRON INC"
## [1] "--------------------------"