In this article we will look how to fit a Latent Topic model and interpret the topics. We will demonstrate topic analysis based on SEC 10-K filings of 85 randomly chosen firms in the Technology sector from the Fortune 1000.
In this homework, we are primarily interested in Item-1A (Risk Factors) of SEC 10-K filings. We have done topic-analysis using the famous LDA model to idetify these risk factors and explore the world of Known Unknowns.
In the section Item 1A of Form 10-K firms reveal exposure to anticipated risks in the coming year. Here, the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.
A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company’s financial performance. Every annual report contains 4 parts and 15 schedules.
We are primarily interested in ITEM 1A - RISK FACTOR. Here, the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.These risk will also have major impact on business growth.
We intend to topic-analyse Item 1A Risk Factor data for these 85 random Tech firms.
We have choosen TFIDF to perform topic analysis. TFIDF helped in extracting the most descriptive terms in a document. Below are word cloud of the same depicting the releveance of both TF and TF IDF.
We can clearly see that TF-IDF is giving better & interpretable results in comparison to TF.
We tried with 2,3,4 and 5 topics but we selected 5 topics. Let’s see why we selected 5 topics for analysis?
We selected K=5 as it produced best results and we were able to clearly make some stories using 5 Topics for analysis. However, only 4 out of 5 topics had some inferences and had a story behind them.
The below topic show the dependencies of semiconductor manufacturing industry on inventory, raw material and suppliers. These factors have direct impact on their net sales, profit and company growth. To minimize the impact companies have to strategize their action accordingly. Manufacturers in similar domain such as Intel, Analog Devices and Texas Instruments are facing similar risk.
Latent Topic 2: Miscellaneous: This topic threw a lot of different words. One of the key words was conversion, where Iron mountain’s customers may shift from paper storage to alternative technologies that require less physical space.
Latent Topic 3: Sources of Revenue - User base, advertising, searches: This topic relates to companies in search engine, social media, cloud space. Most of these companies get their revenues from advsertising space and dependant on large user base. AOL is one of the example whose focus is on online advertising-supported business model. It involves significant risks as their subscription revenues continue to decline, they have become increasingly dependent on advertising revenues.
Latent Topic 4: Health care reforms and other government regulations:
We found a small set of companies impacted by Health care reforms. It’s difficult to determine their impact on the business.
Latent Topic 5: Role of FCC in spectrum regulations:
This topic defines the risk factors in companies relating to Telecommunications. FCC has a major influence on the these companies operations. Telecommunication companies need to follow the norms set by fcc. We can also see that these companies depend on loyalty of subscribers of these services, word like “Subscribers”,“lose customers”,“bussiness customers”,etc..
The company using the raw material need to asses this risk and should work on it either by having other reliable vendors for the supply or making the company self reliant to produce the raw material.
Companies can compare the risk factors with their competitors and then they can see which risks should be mitigated to make the business more risk immune as compared to their competitors.
Organizations can identify which companies depend on their bussiness and can price their products and services competitevly.
4.Company Should have to focus on strategy or action of other companies facing similar risks and issues and try to incoorporate those after detailed analysis.
5.During meetups, company can comeup with a common solutions on common risks so that impact on growth can be minimized.
So, let’s begin the entry to world of Topic Analysis.
Below are the list of Output produced to analyze the topics. - TF Wordcloud - TFIDF Wordcloud - 5 Topic Wordclouds - 5 Topic Co-occurrence - Lift Matrix - Omega Matrix - Theta Matrix - Frequent words Plot
The first step in topic mining is to process the text data and create a Document Term Matrix. We will process the text data and will provide the Document Term Matrix and Text data as R object (.Rds) so that we can directly load them in R work space. For fitting and visualising a Latent Topic Model, we will need following packages.
Lets clear the workspace and invoke required libraries
rm(list=ls()) # Clear the workspace
options(warn = -1)
library("tm")
library("wordcloud")
library("maptpx")
library("igraph")
library("rJava")
library("textir")
library("RWeka")
library("qdap")
library("cluster")
library("fpc")
library("ggplot2")
textdata = readRDS(file.choose()) # Select BD.Technology.Rds OR RF.Technology.Rds data set
dtm0 = readRDS(file.choose()) # Select dtm1.BD.Rds or dtm1.RF.Rds
Lets remove the sparse terms from the above DTM
dtm1 <- removeSparseTerms(x = dtm0, .95) # Remove sparce words
Lets now remove some of the very common words from the DTM, then may not be of much relevance and seem to be very common.
a <- matrix(c(colnames(dtm1)),1,)
delwords <- matrix(c("applied","client","clients","company","table","amp","content","contents","companys"),1,)
commonwords = intersect(a, delwords)
delwordposi = sort(match(commonwords,a),decreasing = TRUE)
for (i in 1:length(delwordposi)){
dtm1 = dtm1[,-delwordposi[i]]
}
Now lets convert the DTM to convert TF to TF-IDF
dtm2inv = weightTfIdf(dtm1, normalize = TRUE) # Use this code to convert tf into tf-idf
# 1- Using Term frequency(tf)
freq1 = (sort(apply(dtm1,2,sum), decreasing =T)) # Calcualte term frequency
freq1[1:50] # View top 50 terms
## products business operations
## 3360 3048 2404
## customers result services
## 2192 2139 1784
## ability addition results
## 1770 1517 1447
## subject future financial-condition
## 1416 1314 1289
## adversely-affect operating-results including
## 1244 1082 1064
## time continue increase
## 988 952 946
## costs sales intellectual-property
## 832 832 811
## unable required competitors
## 766 737 706
## revenue adverse-effect material
## 704 696 681
## failure regulations risks
## 664 663 663
## laws technology significant
## 660 659 655
## market adversely-affected harm
## 640 630 628
## systems parties common-stock
## 614 604 601
## factors technologies number
## 587 586 584
## companies loss revenues
## 577 571 567
## provide rights impact
## 559 558 550
## demand product
## 547 546
#windows() # New plot window
wordcloud(names(freq1), freq1, scale=c(4,0.5),1, max.words=100,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud
title(sub = "Term Frequency - Wordcloud")
# 2- UsingTerm Frequency Inverse Document Frequency (tfidf)
freq2 = (sort(apply(dtm2inv,2,sum), decreasing =T)) # Calcualte term frequency
freq2[1:50] # View top 50 terms
## fcc health-care class
## 0.11556342 0.10469539 0.10176548
## advertising users search
## 0.09953686 0.09329478 0.08728723
## wireless net-sales hardware-systems
## 0.08490011 0.07920468 0.07817538
## devices subscribers software
## 0.07743916 0.07349681 0.07299500
## online stockholders open-source
## 0.07041167 0.07033241 0.06859361
## network spectrum carriers
## 0.06848363 0.06773348 0.06737578
## million websites internet
## 0.06728889 0.06682429 0.06657192
## notes components satellite
## 0.06607728 0.06600597 0.06465871
## manufacturing pension google
## 0.06462452 0.06414784 0.06113031
## credit-facility inventory federal-government
## 0.06110236 0.06034754 0.06029047
## consumers semiconductor-industry indebtedness
## 0.05911113 0.05870343 0.05841334
## distributors conversion fiscal
## 0.05756813 0.05740358 0.05677466
## data-center contract production
## 0.05676620 0.05666777 0.05620974
## subcontractors licensees display
## 0.05573979 0.05560776 0.05537832
## software-products materials business-results
## 0.05501652 0.05492513 0.05457213
## solutions networks directors
## 0.05377086 0.05313552 0.05305243
## subscription contracts
## 0.05246345 0.05220857
#windows() # New plot window
wordcloud(names(freq2), freq2, scale=c(4,0.5),1, max.words=100,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")
# See which are the most frequent words
freq <- sort(colSums(as.matrix(dtm2inv)), decreasing=TRUE)
#head(freq, 14)
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## fcc fcc 0.11556342
## health-care health-care 0.10469539
## class class 0.10176548
## advertising advertising 0.09953686
## users users 0.09329478
## search search 0.08728723
# Plot the most frequent words
#windows()
p <- ggplot(subset(wf, freq>.05), aes(word, freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
[In the LDA, we (i.e., the user) have to input the number of latent topics we think there are in the corpus.] Initially we did not know how many topics really are present, but we have tried analyzing the data with multiple values of K, for ex: 2,3,4,5. And we have got the most cleaner, reasonable and interpretable topics and results with K=5… So, we went ahead to fit a 5 topic model using the TF-IDF.
K = 5 # Choose number of topics in the model
#simfit = topics(dtm1, K = K, verb = 2) # Fit the K topic model
simfit = topics(dtm2inv, K = K, verb = 2) # Fit the K topic model
##
## Estimating on a 85 document collection.
## Fitting the 5 topic model.
## log posterior increase: 1.2, 0.6, 0.4, 0.2, 0.1, done. (L = -946.9)
summary(simfit, nwrd = 12) # Summary of simfit model
##
## Top 12 phrases by topic-over-null term lift (and usage %):
##
## [1] 'wafers', 'wafer', 'foundries', 'semiconductor-products', 'semiconductor-industry', 'wafer-fabrication', 'process-technologies', 'semiconductor', 'manufacturing-processes', 'fabrication', 'manufacturing-process', 'manufacturing-services' (20.7)
## [2] 'brazilian', 'information-management', 'contractual-requirements', 'manufacturing-partners', 'convertible-notes', 'records', 'sales-channels', 'real-estate', 'customer-contracts', 'accumulated', 'irs', 'storage' (20.1)
## [3] 'microsoft', 'windows', 'cloud-based', 'subscription', 'software-products', 'consumers', 'hardware-systems', 'consoles', 'online', 'subscriptions', 'platforms', 'cloud' (20)
## [4] 'health-care', 'security-systems', 'highest', 'payroll', 'market-risk', 'health-insurance', 'federal-government', 'clearing', 'statutes', 'financial-loss', 'risks-facing', 'account-information' (19.8)
## [5] 'subscribers', 'subscriber', 'spectrum', 'wireless-networks', 'consummated', 'verizon', 'lte', 'mhz', 'debt-financing', 'fcc', 'wimax', 'churn' (19.4)
##
## Dispersion = -0.02
Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics. Then, the Topic model gives us mainly two outputs:
One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.
Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.
Let’s sort the matrix \(\theta\) with decreasing order of total term probability and check the few top terms, using \(\theta\):-
a0 = apply(simfit$theta, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]
## topic
## phrase 1 2 3 4
## fcc 2.743454e-06 3.339662e-06 2.983672e-06 3.102463e-06
## health-care 3.756629e-06 4.644913e-06 3.772714e-06 5.612564e-03
## class 5.098074e-06 5.140999e-06 5.249932e-03 5.318606e-06
## advertising 2.393756e-06 2.651254e-06 5.121660e-03 2.916539e-06
## users 3.199533e-06 3.787551e-06 4.799522e-03 3.397154e-06
## wireless 2.665658e-06 2.850212e-06 3.072653e-06 4.089448e-06
## search 2.252650e-06 2.457995e-06 4.472769e-03 2.626506e-06
## devices 7.563963e-05 3.034744e-05 2.898831e-03 1.084804e-04
## hardware-systems 3.406118e-06 5.100126e-06 4.046231e-03 4.084160e-06
## subscribers 2.655436e-06 3.152296e-06 1.271372e-05 3.312795e-06
## topic
## phrase 5
## fcc 6.277246e-03
## health-care 4.298093e-06
## class 7.229513e-06
## advertising 3.948578e-06
## users 3.783304e-06
## wireless 4.593332e-03
## search 2.740956e-06
## devices 9.675240e-04
## hardware-systems 3.705085e-06
## subscribers 4.025614e-03
Here we can see advertising, devices, hardware-systems, search etch have higher probablity for topic 3. Similarly we can see the \(\Omega\) matrix for documents.
Let’s sort the matrix \(\omega\) with decreasing order of total term probability and check the few top terms, using \(\omega\):-
a0 = apply(simfit$omega, 1, sum);
a01 = order(a0, decreasing = TRUE)
simfit$omega[a01[1:10],]
## topic
## document 1 2 3 4 5
## 3 0.38413309 0.1541613 0.1616716 0.1530563 0.1469777
## 13 0.18244784 0.2534216 0.1734561 0.2082247 0.1824497
## 42 0.12622406 0.1148924 0.1058127 0.5226805 0.1303904
## 68 0.13575288 0.1530699 0.1544839 0.1558510 0.4008423
## 81 0.16877185 0.2042642 0.3093618 0.1686324 0.1489698
## 83 0.15262526 0.2011450 0.1645768 0.1501069 0.3315460
## 2 0.36905331 0.1491659 0.1499737 0.1509191 0.1808880
## 4 0.09968835 0.1000903 0.1026223 0.5928737 0.1047255
## 5 0.13326192 0.1663668 0.1533591 0.1641318 0.3828804
## 6 0.10513169 0.1093335 0.1193681 0.1258570 0.5403097
We can say Document 3 loads heavily on topic 1 whereas document 13 loads heavily on topic 2 and rest are nearly equally divided.
Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.
The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.
Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.
Based on the number of terms in DocumentTermMatrix lift calculation may take some time. It should be completed in 2-3 minutes
t = Sys.time()
theta = simfit$theta
lift = theta*0; sum1 = sum(dtm2inv)
for (i in 1:nrow(theta)){
for (j in 1:ncol(theta)){
ptermtopic = 0; pterm = 0;
ptermtopic = theta[i, j] # term i's probability of topic j membership
pterm = sum(dtm2inv[,i])/sum1 # marginal probability of term i's occurrence in corpus
if (is.finite(ptermtopic/pterm)) lift[i, j] = ptermtopic/pterm else lift[i, j] = 0
# so, lift is topic membership probability normalized by occurrence probability
}
}
Sys.time() - t # Total time for calculating lift
## Time difference of 1.611603 mins
let’s print lift for first 10 terms. [Remember, Lift will have the same dimension as the \(\theta\) matrix.]
lift[1:10,]
## topic
## phrase 1 2 3 4
## abilities 0.23053372 4.09557994 0.34976772 0.26842036
## ability 0.00000000 0.00000000 0.00000000 0.00000000
## abroad 0.28891695 0.09371211 1.42461880 2.83579750
## absence 0.40795595 0.09203087 3.80556679 0.08819723
## absolute 0.17611796 0.20506073 3.87752985 0.11490515
## absolute-assurance 1.25305125 3.48477076 0.07069907 0.10644254
## absorb 0.23876220 4.36888529 0.10287237 0.12652026
## accelerate 0.09342606 0.15721273 0.19205275 1.87002620
## accelerated 0.20597130 0.19567025 0.04476654 1.32107880
## acceleration 0.12320942 0.30080173 0.04666774 0.05741015
## topic
## phrase 5
## abilities 0.07177219
## ability 0.00000000
## abroad 0.39614532
## absence 0.63898179
## absolute 0.73513678
## absolute-assurance 0.07441881
## absorb 0.24741516
## accelerate 2.84132945
## accelerated 3.38866488
## acceleration 4.67356471
Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordcloud will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic and we can interpret those terms and able to find the topic on which all of these can be bind together.
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])
# Plot wordcloud
wordcloud(rownames(top_word), top_word, scale=c(4,0.5), 1,
random.order=FALSE, random.color=FALSE,
colors=brewer.pal(8, "Dark2"), use.r.layout = FALSE)
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}
From these wordclouds we can label and interpret topics. To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. As we did in topic wordcloud, first we will find top terms for a topic and then we will construct a co-occurrence matrix. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges. And we also preferred indirect matric instead of directed because its tough to interpret using directed incomparison to indirected.
for (i in 1:K){ # For each topic
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
# Auto Correction - Sometime terms in topic with lift above 1 are less than 20 So auto correction
n = ifelse(length(freq) >= 20, 25, length(freq))
top_word = as.matrix(freq[1:n])
# now for top 20 words let's find Document Term Matrix
mat = dtm2inv[,match(row.names(top_word),colnames(dtm2inv))]
mat = as.matrix(mat)
cmat = t(mat) %*% (mat)
diag(cmat) = 0
# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
vec = cmat[p,]
cutoff = sort(vec, decreasing = T)[3]
cmat[p,][cmat[p,] < cutoff] = 0
}
cmat[cmat < quantile(cmat,.80)] = 0
graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)
plot(graph, #the graph to be plotted
layout=layout.fruchterman.reingold, # the layout method.
vertex.frame.color='blue', #the color of the border of the dots
vertex.label.color='black', #the color of the name labels
vertex.label.font=1, #the font of the name labels
vertex.size = .00001, # Dots size
vertex.label.cex=1.3)
mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}
Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.
eta = function(mat, dtm) {
mat1 = mat/mean(mat);
terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))
for (i in 1:nrow(dtm)){
a11 = as.data.frame(matrix(dtm[i,]));
rownames(a11) = colnames(dtm)
a12 = as.matrix(a11[(a11>0),]);
rownames(a12) = rownames(a11)[(a11>0)];
rownames(a12)[1:4]
a13 = intersect(terms1, rownames(a12));
a13[1:15]; length(a13)
a14a = match(a13, terms1); # positions of matching terms in mat1 matrix
a14b = match(a13, rownames(a12))
a15 = mat1[a14a,]*matrix(rep(a12[a14b,],
ncol(mat1)),
ncol = ncol(mat1))
eta.mat = rbind(eta.mat, apply(a15, 2, mean))
rm(a11, a12, a13, a14a, a14b, a15)
}
eta.mat = eta.mat[2:nrow(eta.mat), ] # remove top zeroes row
row.names(eta.mat)=row.names(dtm)
return(eta.mat)
}
twc = eta(lift, dtm2inv)
head(twc)
## 1 2 3 4 5
## 1 0.0010461553 0.0006164068 0.0005503467 0.0004586269 0.0008207239
## 2 0.0020247775 0.0006449805 0.0006241522 0.0006122275 0.0008940452
## 3 0.0018310012 0.0006682696 0.0006961478 0.0006245990 0.0006030765
## 4 0.0004329069 0.0004457177 0.0007396598 0.0138980023 0.0009317454
## 5 0.0003873071 0.0006090110 0.0005175799 0.0005839441 0.0016477411
## 6 0.0003863032 0.0004991276 0.0006890585 0.0008019815 0.0046988701
Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.
Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic
eta.file.name = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),1] # Store first n companies name in list
}
return(s)
}
temp1 = eta.file.name(twc,textdata,10)
for (i in 1:length(temp1)){
print(paste('Companies loading heavily on topic',i,'are'))
print(temp1[[i]])
print('--------------------------')
}
## [1] "Companies loading heavily on topic 1 are"
## [1] "TEXAS INSTRUMENTS INC" "INTEL CORP"
## [3] "ACUITY BRANDS INC" "XILINX INC"
## [5] "BENCHMARK ELECTRONICS INC" "LSI CORP"
## [7] "MAXIM INTEGRATED PRODUCTS" "ANALOG DEVICES"
## [9] "APPLIED MATERIALS INC" "PLEXUS CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "IRON MOUNTAIN INC" "LEXMARK INTL INC -CL A"
## [3] "HARRIS CORP" "DELL INC"
## [5] "ACUITY BRANDS INC" "EMC CORP/MA"
## [7] "DIEBOLD INC" "CIENA CORP"
## [9] "ITRON INC" "INTL BUSINESS MACHINES CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "AOL INC" "IAC/INTERACTIVECORP"
## [3] "MICROSOFT CORP" "INSIGHT ENTERPRISES INC"
## [5] "ELECTRONIC ARTS INC" "YAHOO INC"
## [7] "GOOGLE INC" "SYSTEMAX INC"
## [9] "INTL GAME TECHNOLOGY" "FACEBOOK INC"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 4 are"
## [1] "AUTOMATIC DATA PROCESSING" "AMPHENOL CORP"
## [3] "UNISYS CORP" "INTL BUSINESS MACHINES CORP"
## [5] "BROADRIDGE FINANCIAL SOLUTNS" "VERIZON COMMUNICATIONS INC"
## [7] "COMPUTER SCIENCES CORP" "DST SYSTEMS INC"
## [9] "XEROX CORP" "FIDELITY NATIONAL INFO SVCS"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 5 are"
## [1] "VERIZON COMMUNICATIONS INC" "WINDSTREAM HOLDINGS INC"
## [3] "FRONTIER COMMUNICATIONS CORP" "SPRINT CORP"
## [5] "LEAP WIRELESS INTL INC" "CENTURYLINK INC"
## [7] "AMPHENOL CORP" "TELEPHONE & DATA SYSTEMS INC"
## [9] "ECHOSTAR CORP" "DST SYSTEMS INC"
## [1] "--------------------------"
Let us find the associated keywords for some of the companies whose area of interest are similar and they talk about similar kind of topics which helped us in interpreting their area of interests, issues etc. The below line of code can be used to display the associations.
#findAssocs(dtm2inv, c("semiconductor","licensees","raw-material","sales","fcc", "google","advertising","health-care","contract"), corlimit=0.60)
Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.
eta.file = function(mat,calib,n) {
s = list() # Blank List
for (i in 1: ncol(mat)) # For each topic
{
read_doc = mat[order(mat[,i], decreasing= T),] # Sort document prop matrix (twc)
read_names = row.names(read_doc[1:n,]) # docuemnt index for first n document
s[[i]] = calib[as.numeric(read_names),2] # Store first n documents in list
}
return(s)
}
temp2 = eta.file(twc,textdata,5)
# for (i in 1:length(temp2)){
# print(paste('Documents loading heavily on topic',i,'are'))
# print(temp2[[i]])
# print('--------------------------')
# }