Latent Topic Analysis

Group Homework Session 4 - Topic Analysis

Group A2029

Chandan Sharma - 71610012
Gaurav Budhiraja - 71610027
Saurabh Gupta - 71610072
Yashu Rastogi - 71610106

In this article we will look how to fit a Latent Topic model and interpret the topics. We will demonstrate topic analysis based on SEC 10-K filings of 85 randomly chosen firms in the Technology sector from the Fortune 1000.

In this homework, we are primarily interested in Item-1A (Risk Factors) of SEC 10-K filings. We have done topic-analysis using the famous LDA model to idetify these risk factors and explore the world of Known Unknowns.

Item 1A- Risk Factors of Form 10-K

In the section Item 1A of Form 10-K firms reveal exposure to anticipated risks in the coming year. Here, the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.

A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company’s financial performance. Every annual report contains 4 parts and 15 schedules.

We are primarily interested in ITEM 1A - RISK FACTOR. Here, the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.These risk will also have major impact on business growth.

We intend to topic-analyse Item 1A Risk Factor data for these 85 random Tech firms.

Analysis

Which one we should use TF or TFIDF? Let’s see…

We have choosen TFIDF to perform topic analysis. TFIDF helped in extracting the most descriptive terms in a document. Below are word cloud of the same depicting the releveance of both TF and TF IDF.

We can clearly see that TF-IDF is giving better & interpretable results in comparison to TF.

What value of K produces best reults?

We tried with 2,3,4 and 5 topics but we selected 5 topics. Let’s see why we selected 5 topics for analysis?

We selected K=5 as it produced best results and we were able to clearly make some stories using 5 Topics for analysis. However, only 4 out of 5 topics had some inferences and had a story behind them.

Latent Topic 1:* Manufacturing and Suppliers:

The below topic show the dependencies of semiconductor manufacturing industry on inventory, raw material and suppliers. These factors have direct impact on their net sales, profit and company growth. To minimize the impact companies have to strategize their action accordingly. Manufacturers in similar domain such as Intel, Analog Devices and Texas Instruments are facing similar risk.

Latent Topic 2: Miscellaneous: This topic threw a lot of different words. One of the key words was conversion, where Iron mountain’s customers may shift from paper storage to alternative technologies that require less physical space.

Latent Topic 3: Sources of Revenue - User base, advertising, searches: This topic relates to companies in search engine, social media, cloud space. Most of these companies get their revenues from advsertising space and dependant on large user base. AOL is one of the example whose focus is on online advertising-supported business model. It involves significant risks as their subscription revenues continue to decline, they have become increasingly dependent on advertising revenues.

Latent Topic 4: Health care reforms and other government regulations:

We found a small set of companies impacted by Health care reforms. It’s difficult to determine their impact on the business.

Latent Topic 5: Role of FCC in spectrum regulations:

This topic defines the risk factors in companies relating to Telecommunications. FCC has a major influence on the these companies operations. Telecommunication companies need to follow the norms set by fcc. We can also see that these companies depend on loyalty of subscribers of these services, word like “Subscribers”,“lose customers”,“bussiness customers”,etc..

Application - Topic Analysis of Risk factors

One company’s risk can be advantage of another. Supposingly, if a firm uses some raw materials from another comapany then that company that provides raw material has the potential to harm the business of the comapny using these raw material.

The company using the raw material need to asses this risk and should work on it either by having other reliable vendors for the supply or making the company self reliant to produce the raw material.

Companies can compare the risk factors with their competitors and then they can see which risks should be mitigated to make the business more risk immune as compared to their competitors.
Organizations can identify which companies depend on their bussiness and can price their products and services competitevly.

4.Company Should have to focus on strategy or action of other companies facing similar risks and issues and try to incoorporate those after detailed analysis.

5.During meetups, company can comeup with a common solutions on common risks so that impact on growth can be minimized.

Below are the technical details

So, let’s begin the entry to world of Topic Analysis.

Below are the list of Output produced to analyze the topics. - TF Wordcloud - TFIDF Wordcloud - 5 Topic Wordclouds - 5 Topic Co-occurrence - Lift Matrix - Omega Matrix - Theta Matrix - Frequent words Plot

The first step in topic mining is to process the text data and create a Document Term Matrix. We will process the text data and will provide the Document Term Matrix and Text data as R object (.Rds) so that we can directly load them in R work space. For fitting and visualising a Latent Topic Model, we will need following packages.

tm - for creating DocumentTermMatrix
maptpx - for fitting Latent Topic Model
wordcloud - for plotting wordclouds
igraph - for plotting terms co-occurrence graph
rJava - Low-Level R to Java Interface
textir - Inverse Regression for Text Analysis
RWeka - R/Weka Interface for Machine learning algorithms
qdap - for parsing tools for preparing transcript data

Lets clear the workspace and invoke required libraries

rm(list=ls()) # Clear the workspace
options(warn = -1)

library("tm")
library("wordcloud")
library("maptpx")
library("igraph")
library("rJava")
library("textir")
library("RWeka")
library("qdap")
library("cluster")
library("fpc")   
library("ggplot2")

Now read the text data and DocumentTermMatrix (DTM) in R session with readRDS function. textdata is a data frame with 2 columns. First column is Company name and second column is text of Item 1A- Risk Factors from 10K filing. From second column of this text data, We have got the DTM

textdata = readRDS(file.choose())     # Select BD.Technology.Rds OR RF.Technology.Rds data set 
dtm0 = readRDS(file.choose())         # Select dtm1.BD.Rds or dtm1.RF.Rds

Lets remove the sparse terms from the above DTM

dtm1 <- removeSparseTerms(x = dtm0, .95) # Remove sparce words

Lets now remove some of the very common words from the DTM, then may not be of much relevance and seem to be very common.

a <-  matrix(c(colnames(dtm1)),1,)
delwords <- matrix(c("applied","client","clients","company","table","amp","content","contents","companys"),1,)
commonwords = intersect(a, delwords) 
delwordposi = sort(match(commonwords,a),decreasing = TRUE) 
 for (i in 1:length(delwordposi)){
   dtm1 = dtm1[,-delwordposi[i]]
  }

Now lets convert the DTM to convert TF to TF-IDF

dtm2inv = weightTfIdf(dtm1, normalize = TRUE) # Use this code to convert tf into tf-idf

Basic Analysis

Let us start doing some basic analysis of the above data and plot the word cloud for some of the frequent terms in DTM

#   1- Using Term frequency(tf)             
freq1 = (sort(apply(dtm1,2,sum), decreasing =T)) # Calcualte term frequency
freq1[1:50]                                      # View top 50 terms

##              products              business            operations 
##                  3360                  3048                  2404 
##             customers                result              services 
##                  2192                  2139                  1784 
##               ability              addition               results 
##                  1770                  1517                  1447 
##               subject                future   financial-condition 
##                  1416                  1314                  1289 
##      adversely-affect     operating-results             including 
##                  1244                  1082                  1064 
##                  time              continue              increase 
##                   988                   952                   946 
##                 costs                 sales intellectual-property 
##                   832                   832                   811 
##                unable              required           competitors 
##                   766                   737                   706 
##               revenue        adverse-effect              material 
##                   704                   696                   681 
##               failure           regulations                 risks 
##                   664                   663                   663 
##                  laws            technology           significant 
##                   660                   659                   655 
##                market    adversely-affected                  harm 
##                   640                   630                   628 
##               systems               parties          common-stock 
##                   614                   604                   601 
##               factors          technologies                number 
##                   587                   586                   584 
##             companies                  loss              revenues 
##                   577                   571                   567 
##               provide                rights                impact 
##                   559                   558                   550 
##                demand               product 
##                   547                   546

#windows()  # New plot window
wordcloud(names(freq1), freq1, scale=c(4,0.5),1, max.words=100,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
title(sub = "Term Frequency - Wordcloud")

Let us start doing some basic analysis of the above data and plot the word cloud for some of the frequent terms in DTM

#   2- UsingTerm Frequency Inverse Document Frequency (tfidf)             
freq2 = (sort(apply(dtm2inv,2,sum), decreasing =T)) # Calcualte term frequency
freq2[1:50]                                     # View top 50 terms

##                    fcc            health-care                  class 
##             0.11556342             0.10469539             0.10176548 
##            advertising                  users                 search 
##             0.09953686             0.09329478             0.08728723 
##               wireless              net-sales       hardware-systems 
##             0.08490011             0.07920468             0.07817538 
##                devices            subscribers               software 
##             0.07743916             0.07349681             0.07299500 
##                 online           stockholders            open-source 
##             0.07041167             0.07033241             0.06859361 
##                network               spectrum               carriers 
##             0.06848363             0.06773348             0.06737578 
##                million               websites               internet 
##             0.06728889             0.06682429             0.06657192 
##                  notes             components              satellite 
##             0.06607728             0.06600597             0.06465871 
##          manufacturing                pension                 google 
##             0.06462452             0.06414784             0.06113031 
##        credit-facility              inventory     federal-government 
##             0.06110236             0.06034754             0.06029047 
##              consumers semiconductor-industry           indebtedness 
##             0.05911113             0.05870343             0.05841334 
##           distributors             conversion                 fiscal 
##             0.05756813             0.05740358             0.05677466 
##            data-center               contract             production 
##             0.05676620             0.05666777             0.05620974 
##         subcontractors              licensees                display 
##             0.05573979             0.05560776             0.05537832 
##      software-products              materials       business-results 
##             0.05501652             0.05492513             0.05457213 
##              solutions               networks              directors 
##             0.05377086             0.05313552             0.05305243 
##           subscription              contracts 
##             0.05246345             0.05220857

#windows()  # New plot window
wordcloud(names(freq2), freq2, scale=c(4,0.5),1, max.words=100,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")

Let us see below which are the most frequent terms and plot them

# See which are the most frequent words
freq <- sort(colSums(as.matrix(dtm2inv)), decreasing=TRUE)   
#head(freq, 14)  

wf <- data.frame(word=names(freq), freq=freq)   
head(wf)

##                    word       freq
## fcc                 fcc 0.11556342
## health-care health-care 0.10469539
## class             class 0.10176548
## advertising advertising 0.09953686
## users             users 0.09329478
## search           search 0.08728723

# Plot the most frequent words
#windows() 
p <- ggplot(subset(wf, freq>.05), aes(word, freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p

[In the LDA, we (i.e., the user) have to input the number of latent topics we think there are in the corpus.] Initially we did not know how many topics really are present, but we have tried analyzing the data with multiple values of K, for ex: 2,3,4,5. And we have got the most cleaner, reasonable and interpretable topics and results with K=5… So, we went ahead to fit a 5 topic model using the TF-IDF.

K = 5  # Choose number of topics in the model
#simfit = topics(dtm1,  K = K, verb = 2) # Fit the K topic model
simfit = topics(dtm2inv,  K = K, verb = 2) # Fit the K topic model

## 
## Estimating on a 85 document collection.
## Fitting the 5 topic model.
## log posterior increase: 1.2, 0.6, 0.4, 0.2, 0.1, done. (L = -946.9)

summary(simfit, nwrd = 12)  # Summary of simfit model

## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'wafers', 'wafer', 'foundries', 'semiconductor-products', 'semiconductor-industry', 'wafer-fabrication', 'process-technologies', 'semiconductor', 'manufacturing-processes', 'fabrication', 'manufacturing-process', 'manufacturing-services' (20.7) 
## [2] 'brazilian', 'information-management', 'contractual-requirements', 'manufacturing-partners', 'convertible-notes', 'records', 'sales-channels', 'real-estate', 'customer-contracts', 'accumulated', 'irs', 'storage' (20.1) 
## [3] 'microsoft', 'windows', 'cloud-based', 'subscription', 'software-products', 'consumers', 'hardware-systems', 'consoles', 'online', 'subscriptions', 'platforms', 'cloud' (20) 
## [4] 'health-care', 'security-systems', 'highest', 'payroll', 'market-risk', 'health-insurance', 'federal-government', 'clearing', 'statutes', 'financial-loss', 'risks-facing', 'account-information' (19.8) 
## [5] 'subscribers', 'subscriber', 'spectrum', 'wireless-networks', 'consummated', 'verizon', 'lte', 'mhz', 'debt-financing', 'fcc', 'wimax', 'churn' (19.4) 
## 
## Dispersion = -0.02

Suppose there are D documents and T term-tokens in the corpus. And we are fitting K topics. Then, the Topic model gives us mainly two outputs:

One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is T x K.

Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.

Let’s sort the matrix \(\theta\) with decreasing order of total term probability and check the few top terms, using \(\theta\):-

a0 = apply(simfit$theta, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]

##                   topic
## phrase                        1            2            3            4
##   fcc              2.743454e-06 3.339662e-06 2.983672e-06 3.102463e-06
##   health-care      3.756629e-06 4.644913e-06 3.772714e-06 5.612564e-03
##   class            5.098074e-06 5.140999e-06 5.249932e-03 5.318606e-06
##   advertising      2.393756e-06 2.651254e-06 5.121660e-03 2.916539e-06
##   users            3.199533e-06 3.787551e-06 4.799522e-03 3.397154e-06
##   wireless         2.665658e-06 2.850212e-06 3.072653e-06 4.089448e-06
##   search           2.252650e-06 2.457995e-06 4.472769e-03 2.626506e-06
##   devices          7.563963e-05 3.034744e-05 2.898831e-03 1.084804e-04
##   hardware-systems 3.406118e-06 5.100126e-06 4.046231e-03 4.084160e-06
##   subscribers      2.655436e-06 3.152296e-06 1.271372e-05 3.312795e-06
##                   topic
## phrase                        5
##   fcc              6.277246e-03
##   health-care      4.298093e-06
##   class            7.229513e-06
##   advertising      3.948578e-06
##   users            3.783304e-06
##   wireless         4.593332e-03
##   search           2.740956e-06
##   devices          9.675240e-04
##   hardware-systems 3.705085e-06
##   subscribers      4.025614e-03

Here we can see advertising, devices, hardware-systems, search etch have higher probablity for topic 3. Similarly we can see the \(\Omega\) matrix for documents.

Let’s sort the matrix \(\omega\) with decreasing order of total term probability and check the few top terms, using \(\omega\):-

a0 = apply(simfit$omega, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$omega[a01[1:10],]

##         topic
## document          1         2         3         4         5
##       3  0.38413309 0.1541613 0.1616716 0.1530563 0.1469777
##       13 0.18244784 0.2534216 0.1734561 0.2082247 0.1824497
##       42 0.12622406 0.1148924 0.1058127 0.5226805 0.1303904
##       68 0.13575288 0.1530699 0.1544839 0.1558510 0.4008423
##       81 0.16877185 0.2042642 0.3093618 0.1686324 0.1489698
##       83 0.15262526 0.2011450 0.1645768 0.1501069 0.3315460
##       2  0.36905331 0.1491659 0.1499737 0.1509191 0.1808880
##       4  0.09968835 0.1000903 0.1026223 0.5928737 0.1047255
##       5  0.13326192 0.1663668 0.1533591 0.1641318 0.3828804
##       6  0.10513169 0.1093335 0.1193681 0.1258570 0.5403097

We can say Document 3 loads heavily on topic 1 whereas document 13 loads heavily on topic 2 and rest are nearly equally divided.

Lift Calculation

Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.

Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.

Based on the number of terms in DocumentTermMatrix lift calculation may take some time. It should be completed in 2-3 minutes

t = Sys.time()
theta = simfit$theta
lift = theta*0;  sum1 = sum(dtm2inv)
for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]     # term i's probability of topic j membership
    pterm = sum(dtm2inv[,i])/sum1   # marginal probability of term i's occurrence in corpus
    if (is.finite(ptermtopic/pterm)) lift[i, j] = ptermtopic/pterm else lift[i, j] = 0
          # so, lift is topic membership probability normalized by occurrence probability
    }
}
Sys.time() - t # Total time for calculating lift

## Time difference of 1.611603 mins

let’s print lift for first 10 terms. [Remember, Lift will have the same dimension as the \(\theta\) matrix.]

lift[1:10,]

##                     topic
## phrase                        1          2          3          4
##   abilities          0.23053372 4.09557994 0.34976772 0.26842036
##   ability            0.00000000 0.00000000 0.00000000 0.00000000
##   abroad             0.28891695 0.09371211 1.42461880 2.83579750
##   absence            0.40795595 0.09203087 3.80556679 0.08819723
##   absolute           0.17611796 0.20506073 3.87752985 0.11490515
##   absolute-assurance 1.25305125 3.48477076 0.07069907 0.10644254
##   absorb             0.23876220 4.36888529 0.10287237 0.12652026
##   accelerate         0.09342606 0.15721273 0.19205275 1.87002620
##   accelerated        0.20597130 0.19567025 0.04476654 1.32107880
##   acceleration       0.12320942 0.30080173 0.04666774 0.05741015
##                     topic
## phrase                        5
##   abilities          0.07177219
##   ability            0.00000000
##   abroad             0.39614532
##   absence            0.63898179
##   absolute           0.73513678
##   absolute-assurance 0.07441881
##   absorb             0.24741516
##   accelerate         2.84132945
##   accelerated        3.38866488
##   acceleration       4.67356471

Word clouds

Now we have lift and theta for each term and each topic. We can plot a wordcloud for each topic in which terms will be selected if lift is above 1 and size will be proportional to term-probability. This wordcloud will give us an idea of the Latent Topic. Let’s plot top 100 terms in each topic and we can interpret those terms and able to find the topic on which all of these can be bind together.

for (i in 1:K){       # For each topic 
  a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
  freq = theta[a0,i] # Theta for terms greator than 1
  freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
  
  # Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
  n = ifelse(length(freq) >= 100, 100, length(freq))
  top_word = as.matrix(freq[1:n])
  
  # Plot wordcloud
  wordcloud(rownames(top_word), top_word,  scale=c(4,0.5), 1,
            random.order=FALSE, random.color=FALSE, 
            colors=brewer.pal(8, "Dark2"), use.r.layout = FALSE)
  mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}

Co-Occurrence Graphs

From these wordclouds we can label and interpret topics. To get clearer picture of topics, let’s plot top 20 terms co-occurrence graph. As we did in topic wordcloud, first we will find top terms for a topic and then we will construct a co-occurrence matrix. Once co-occurrence matrix is constructed, we can plot this matrix using graph.adjacency function from igraph package. Note that for better readability we are censoring this matrix for top 2 edges. And we also preferred indirect matric instead of directed because its tough to interpret using directed incomparison to indirected.

for (i in 1:K){       # For each topic 
  a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
  freq = theta[a0,i] # Theta for terms greator than 1
  freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i
  
  # Auto Correction -  Sometime terms in topic with lift above 1 are less than 20 So auto correction
  n = ifelse(length(freq) >= 20, 25, length(freq))
  top_word = as.matrix(freq[1:n])
  
  # now for top 20 words let's find Document Term Matrix
  mat  = dtm2inv[,match(row.names(top_word),colnames(dtm2inv))]
  
  mat = as.matrix(mat)
  cmat  = t(mat) %*% (mat)
  diag(cmat) = 0
  
  # Let's limit number of connections to 2
    for (p in 1:nrow(cmat)){
      vec = cmat[p,]
      cutoff = sort(vec, decreasing = T)[3]
      cmat[p,][cmat[p,] < cutoff] = 0
    }
  
  cmat[cmat <  quantile(cmat,.80)] = 0
  graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)

  plot(graph,       #the graph to be plotted
       layout=layout.fruchterman.reingold,  # the layout method. 
       vertex.frame.color='blue',       #the color of the border of the dots 
       vertex.label.color='black',      #the color of the name labels
       vertex.label.font=1,         #the font of the name labels
       vertex.size = .00001,   # Dots size
       vertex.label.cex=1.3)
  mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}

Topic Proportion

Now we have lift matrix and also we have DocumentTermMatrix. So we can create a weighing scheme for each document and each topic, which will give proportion of a topic in a document. Here first I am defining a function and later I am calling it to calculate topic proportion in documents.

eta = function(mat, dtm) {
  
  mat1 = mat/mean(mat);  
  terms1 = rownames(mat1);
  eta.mat = matrix(0, 1, ncol(mat1))
  
  for (i in 1:nrow(dtm)){
    a11 = as.data.frame(matrix(dtm[i,])); 
    rownames(a11) = colnames(dtm)
    a12 = as.matrix(a11[(a11>0),]);  
    rownames(a12) = rownames(a11)[(a11>0)]; 
    rownames(a12)[1:4]
    a13 = intersect(terms1, rownames(a12)); 
    a13[1:15];  length(a13)
    a14a = match(a13, terms1);      # positions of matching terms in mat1 matrix
    a14b = match(a13, rownames(a12))        
    a15 = mat1[a14a,]*matrix(rep(a12[a14b,], 
                                 ncol(mat1)), 
                             ncol = ncol(mat1))
    eta.mat = rbind(eta.mat, apply(a15, 2, mean))   
    rm(a11, a12, a13, a14a, a14b, a15)
  }
  eta.mat = eta.mat[2:nrow(eta.mat), ]  # remove top zeroes row
  row.names(eta.mat)=row.names(dtm)
  return(eta.mat)
}

twc = eta(lift, dtm2inv)
head(twc)

##              1            2            3            4            5
## 1 0.0010461553 0.0006164068 0.0005503467 0.0004586269 0.0008207239
## 2 0.0020247775 0.0006449805 0.0006241522 0.0006122275 0.0008940452
## 3 0.0018310012 0.0006682696 0.0006961478 0.0006245990 0.0006030765
## 4 0.0004329069 0.0004457177 0.0007396598 0.0138980023 0.0009317454
## 5 0.0003873071 0.0006090110 0.0005175799 0.0005839441 0.0016477411
## 6 0.0003863032 0.0004991276 0.0006890585 0.0008019815 0.0046988701

Now we have topic proportion in a Document, we can find the top documents loading on a topic and read them for better interpretation of topics.

Here first I am defining a function which first sorts twc matrix in decreasing order and then picks top n (n = 5) documents name. Then I am calling this function with required arguments and printing the company names for each topic

eta.file.name = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),1]     # Store first n companies name in list  
  }
  return(s)
}

temp1 = eta.file.name(twc,textdata,10)

for (i in 1:length(temp1)){
  print(paste('Companies loading heavily on topic',i,'are'))
  print(temp1[[i]])
  print('--------------------------')
}

## [1] "Companies loading heavily on topic 1 are"
##  [1] "TEXAS INSTRUMENTS INC"     "INTEL CORP"               
##  [3] "ACUITY BRANDS INC"         "XILINX INC"               
##  [5] "BENCHMARK ELECTRONICS INC" "LSI CORP"                 
##  [7] "MAXIM INTEGRATED PRODUCTS" "ANALOG DEVICES"           
##  [9] "APPLIED MATERIALS INC"     "PLEXUS CORP"              
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
##  [1] "IRON MOUNTAIN INC"           "LEXMARK INTL INC  -CL A"    
##  [3] "HARRIS CORP"                 "DELL INC"                   
##  [5] "ACUITY BRANDS INC"           "EMC CORP/MA"                
##  [7] "DIEBOLD INC"                 "CIENA CORP"                 
##  [9] "ITRON INC"                   "INTL BUSINESS MACHINES CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
##  [1] "AOL INC"                 "IAC/INTERACTIVECORP"    
##  [3] "MICROSOFT CORP"          "INSIGHT ENTERPRISES INC"
##  [5] "ELECTRONIC ARTS INC"     "YAHOO INC"              
##  [7] "GOOGLE INC"              "SYSTEMAX INC"           
##  [9] "INTL GAME TECHNOLOGY"    "FACEBOOK INC"           
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 4 are"
##  [1] "AUTOMATIC DATA PROCESSING"    "AMPHENOL CORP"               
##  [3] "UNISYS CORP"                  "INTL BUSINESS MACHINES CORP" 
##  [5] "BROADRIDGE FINANCIAL SOLUTNS" "VERIZON COMMUNICATIONS INC"  
##  [7] "COMPUTER SCIENCES CORP"       "DST SYSTEMS INC"             
##  [9] "XEROX CORP"                   "FIDELITY NATIONAL INFO SVCS" 
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 5 are"
##  [1] "VERIZON COMMUNICATIONS INC"   "WINDSTREAM HOLDINGS INC"     
##  [3] "FRONTIER COMMUNICATIONS CORP" "SPRINT CORP"                 
##  [5] "LEAP WIRELESS INTL INC"       "CENTURYLINK INC"             
##  [7] "AMPHENOL CORP"                "TELEPHONE & DATA SYSTEMS INC"
##  [9] "ECHOSTAR CORP"                "DST SYSTEMS INC"             
## [1] "--------------------------"

Let us find the associated keywords for some of the companies whose area of interest are similar and they talk about similar kind of topics which helped us in interpreting their area of interests, issues etc. The below line of code can be used to display the associations.

#findAssocs(dtm2inv, c("semiconductor","licensees","raw-material","sales","fcc", "google","advertising","health-care","contract"), corlimit=0.60)

Similarly we can find top text document. Since these documents are very large I am not printing them here. You can uncomment the code below and print the documents to read them as per your requirement.

eta.file = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),2]     # Store first n documents in list  
  }
  return(s)
}

temp2 = eta.file(twc,textdata,5)

# for (i in 1:length(temp2)){
#   print(paste('Documents loading heavily on topic',i,'are'))
#   print(temp2[[i]])
#   print('--------------------------')
# }

Latent Topic Analysis

Group A2029 @ CBA Section B @ ISB

20 March 2016

Group Homework Session 4 - Topic Analysis

Group A2029

Item 1A- Risk Factors of Form 10-K

Analysis

Which one we should use TF or TFIDF? Let’s see…

What value of K produces best reults?

Application - Topic Analysis of Risk factors

Below are the technical details

Basic Analysis

Let us start doing some basic analysis of the above data and plot the word cloud for some of the frequent terms in DTM

Let us start doing some basic analysis of the above data and plot the word cloud for some of the frequent terms in DTM

Let us see below which are the most frequent terms and plot them

Lift Calculation

Word clouds

Co-Occurrence Graphs

Topic Proportion