"Data Crunchers Consulting"

Prepared By: “Vaneesh Narayanan (71610099), Subba Reddy Yeruva (71610085), Surajit Dhar (71610091)

INTRODUCTION

As more and more data gets digitalized, it is increasingly becoming a challenge to scrape through for information. One such example we focus in this paper is to evaluate different areas of risks faced by companies.

For doing so, we refer to the annual reports filed by Companies to the U.S. Securities and Exchange Commission (SEC). A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company’s financial performance [1]. The 10-K includes information such as company history, organizational structure, executive compensation, equity, subsidiaries, and audited financial statements, among other information

OBJECTIVE

Apart from the usual information like volatility of stocks, seasonal aspects, companies in their Form 10-K publish information like

“Our top 15 customers account for approximately 80% of our net sales”
“product-liability suits have been filed against us”

These serve as warnings to potential investors on the state of the companies[3]. The objective is to categorize the risks faced by companies which can be serve as great input to adequately warn investors and potential investors.

SIGNIFICANCE OF THE ANALYSIS

Most of this is standard stuff – for example, stocks are volatile and some businesses are seasonal. Look for anything out of the ordinary, such as “our top 15 customers account for approximately 80% of our net sales,” or “numerous product-liability suits have been filed against us” – real warnings in the latest 10-K of Matrixx Initiatives, a maker of over-the-counter drugs. Or consider the language in the 10-K filed by Live Nation Entertainment (symbol LYV), the concert promoter and owner of Ticketmaster.com: “We have a large amount of debt and lease obligations that could restrict our operations and impair our financial condition.”

STRUCTURE OF FORM 10 K

The contents of a Form 10K is as below

ANALYSIS

As per David Blei, Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents. Topic models can organize the collection according to the discovered themes[2]

Topic Models are a neat way to explore or understand any corpus collection. We begin by clearing the workspace and loading the required packages like so,

rm(list=ls()) # Clear the workspace

library("tm")
library("wordcloud")
library("maptpx")
library("igraph")
library(maptpx)
library(LDAvis)
library(lda)

For beverity, we downloaded the Form 10-K data and extracted the Risk Section of the companies from them.

textdata = readRDS("RF.Technology.Rds")

We calculate both the term frequency and the inverted term frequency for evaluation

Doc.id=seq(1:length(textdata$Company_Name))            # Assign Document no for each Document 
companyRDF=data.frame(Doc.id,textdata)         # Create a dataframe for text documents with document ID

stpw = c("item.","1a","risk","factors","may","and","our","the","that","for","are","also","u","able","use","will","can","s")      # Select stopwords.txt file
stpw1 = stopwords('english')         # tm package stop word list
comn  = unique(c(stpw, stpw1))       # Union of two list
stopwords = unique(c(gsub("'","",comn),comn)) # final stop word lsit after removing punctuation

#############################################################
#                        Text Cleaning                      #
#############################################################

text.clean = function(x)                          # text data
{
  x  =  gsub("<.*?>", "", x)                  # regex for removing HTML tags
  x  =  gsub("[^[:alnum:]///' ]", " ", x)     # keep only alpha numeric 
  x  =  iconv(x, "latin1", "ASCII", sub="")   # Keep only ASCII characters
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removePunctuation(x)                # removing punctuation marks
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  x  =  gsub("'", "", x)                    # remove apostrophes
  x  =  gsub("[[:cntrl:]]", " ", x)         # replace control characters with space
  x  =  gsub("^[[:space:]]+", "", x)        # remove whitespace at beginning of documents
  x  =  gsub("[[:space:]]+$", "", x)        # remove whitespace at end of documents
  return(x)
}

###########################################################
# Customize Document Term Matrix
###########################################################
custom.dtm  = function(x1,                               # Text Corpus
                scheme)                           # tf or tfidf
{
  
tdm = TermDocumentMatrix(x1)

a1 = apply(tdm, 1, sum)
a2 =((a1 >= 2))
tdm.new = tdm[a2, ]

# remove blank documents (i.e. columns with zero sums)
a0 = NULL; 
for (i1 in 1:ncol(tdm.new)){ if (sum(tdm.new[, i1]) == 0) {a0 = c(a0, i1)} }
length(a0)    # no. of empty docs in the corpus
if (length(a0) >0) { tdm.new1 = tdm.new[, -a0]} else {tdm.new1 = tdm.new};

dim(tdm.new1)    # reduced tdm
if (scheme == "tfidf") {
x2mat = t(weightTfIdf(tdm.new1))
}
else {x2mat = t((tdm.new1))}
return(x2mat)
}


##############################################################

companyRDF$RF <-  text.clean(companyRDF$RF)      # basic HTML Cleaning etc
companyRDF$RF <-  removeWords(companyRDF$RF,stopwords)            # removing stopwords created above

Term Frequency of term t in document d is de???ned as the number of times that t occurs in d [7] whereas Inverse Document Frequency Estimate the rarity of a term in the whole document collection. (If a term occurs in all the documents of the collection, its IDF is zero.)

#tokenize on space and output as a list:
  doc.list <- strsplit(companyRDF$RF, "[[:space:]]+")

  # compute the table of terms:
  term.table <- table(unlist(doc.list))
  term.table <- sort(term.table, decreasing = TRUE)

For our analysis, we shall use the tf-idf as this will help normalize the effect of terms occuring across all documents by providing a lesser weightage.

########################################################
#             Create Document Term Matrix              #
########################################################

x1 = Corpus(VectorSource(companyRDF$RF))          # Create the corpus
#x1 = n.gram(x1,"bi",2)                   # Encoding bi-gram with atleast frequency 2 as uni-gram

dtm1 = custom.dtm(x1,"tf")               # Document Term Frequency 
dtm2 = custom.dtm(x1,"tfidf")            # Term Frequency Inverse Document Frequency Scheme

freq1 = (sort(apply(dtm1,2,sum), decreasing =T)) # Calcualte term frequency

su <- dtm1* dtm2

term.table = (sort(apply(su,2,sum), decreasing =T)) # Calcualte term frequency

We will start with building the unique vocabulary list within the corpus followed by ones which map to each of the companies

vocab <- names(term.table)

get.terms <- function(x) {
    index <- match(x, vocab)
    index <- index[!is.na(index)]
    rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
  }

documents <- lapply(doc.list, get.terms)

We will record the statistics related to the dataset, like so.

 D <- length(documents)      # number of documents (85)
  W <- length(vocab)         # number of terms in the vocab (6662)
  doc.length <- sapply(documents, function(x) sum(x[2, ]))  # number of tokens per document 
  N <- sum(doc.length)  # total number of tokens in the data (385718)
  term.frequency <- as.integer(term.table)

There are D=85 documents and W=6662 term-tokens in the corpus. And we are required to identify K topics.

The Topic model gives us mainly two outputs:

One, a \(\theta\) matrix of term-probabilities - which tells us for each term, what is the probability that the term belongs to each topic. So its dimension is W x K.

Two, a \(\omega\) document-composition matrix - which is probability mass distribution of topic proportions in document. So its dimension is D x K.

We now set up a topic model with 6 topics. The scalar value of Dirichlet hyperparameter for topic proportions (\(\alpha\)) and topic multinomials are taken as 0.02 and 0.02 respectively.

  K <- 6
  G <- 1000
  alpha <- 0.02
  eta <- 0.02
  
  # Fit the model:
  set.seed(357)

  fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K
                                     , vocab = vocab
                                     ,num.iterations = G, alpha = alpha
                                     ,eta = eta, initial = NULL, burnin = 0
                                     ,compute.log.likelihood = TRUE)

Visualizing the fitted model using LDAvis

  theta <- t(apply(fit$document_sums + alpha, 2, function(x) x/sum(x)))
  phi <- t(apply(t(fit$topics) + eta, 2, function(x) x/sum(x)))

We’ve already computed the number of tokens per document and the frequency of the terms across the entire corpus. We save these, along with \(\theta\) , \(\omega\) and vocab, in a list as the data object RiskAnalysis, which is included in the LDAvis package.

  RiskAnalysis <- list(phi = phi,
                       theta = theta,
                       doc.length = doc.length,
                       vocab = vocab,
                       term.frequency = term.frequency)

Now we’re ready to call the CreateJSON() function in LDAvis. This function will return a character string representing a JSON object used to populate the visualization. The createJSON() function computes topic frequencies, inter-topic distances, and projects topics onto a two-dimensional plane to represent their similarity to each other.

  json <- createJSON(phi = RiskAnalysis$phi, 
                     theta = RiskAnalysis$theta, 
                     doc.length = RiskAnalysis$doc.length, 
                     vocab = RiskAnalysis$vocab, 
                     term.frequency = RiskAnalysis$term.frequency)

The serVis() function can take json and serve the result in a variety of ways. We have commented the below code as this is an interactive one.

  #serVis(json)

Here is how it looks for the 6 topics that we chose

Overall

Topic-1

Topic-2

Topic-3

Topic-4

Topic-5

Topic-6

We can see that Topic-2 and Topic-3 overlap each other and this is also evident from the terms within them. However, if we look at them closely, Topic-3 is more of on the manufacturing industry which talks of supply management, demand & supply etc. Topic-2 is more on software products, operation, revenue and services.

We choose a value K=6

  options(width = 500)
  
K = 6 # Choose number of topics in the model
simfit = topics(dtm2,  K = K, verb = 2) # Fit the K topic model

## 
## Estimating on a 85 document collection.
## Fitting the 6 topic model.
## log posterior increase: 0.8, 1, 0.6, 0.1, 0, done. (L = -709)

summary(simfit, nwrd = 12)  # Summary of simfit model

## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'bull', 'corning', 'rsquo', 'glass', 'teledyne', 'middot', 'vpg', 'ndash', 'vishay', 'dalsa', 'dow', 'lecroy' (17) 
## [2] 'aol', 'advertisers', 'yahoo', 'ads', 'advertising', 'tapp', 'reit', 'apps', 'engagement', 'zuckerberg', 'ibx', 'titles' (16.9) 
## [3] 'frontier', 'households', 'mbps', 'switched', 'escrow', 'unserved', 'windstream', 'rural', 'collective', 'territories', 'bargaining', 'tower' (16.8) 
## [4] 'flash', 'nand', 'captive', 'ssd', 'solar', 'modules', 'memory', 'bics', 'reram', 'module', 'applied', 'wafers' (16.8) 
## [5] 'merchant', 'groupons', 'companys', 'peo', 'nhs', 'clients', 'iaccn', 'motivated', 'ibms', 'client', 'csc', 'lorenzo' (16.8) 
## [6] 'leap', 'cricket', 'preservation', 'blurred', 'dated', 'deploys', 'mvno', 'subsidized', 'waiting', 'rollout', 'leaps', 'dividing' (15.8) 
## 
## Dispersion = -0.01

Let’s view the term probability matrix \(\theta\) sort this matrix with decreasing order of total term probability and check the few top terms:

options(width = 500)
a0 = apply(simfit$theta, 1, sum); 
a01 = order(a0, decreasing = TRUE)
simfit$theta[a01[1:10],]

##            topic
## phrase                 1            2            3            4            5            6
##   tds       2.648959e-06 2.586869e-06 2.805227e-06 2.680430e-06 2.702986e-06 3.510401e-02
##   brocades  2.842265e-06 2.669269e-06 2.823106e-06 2.920408e-06 2.799614e-06 2.506861e-02
##   clients   3.646912e-06 3.546243e-06 3.928365e-06 3.578786e-06 1.898607e-02 3.969126e-06
##   companys  3.492683e-06 3.201781e-06 3.634969e-06 3.471822e-06 1.816372e-02 3.747852e-06
##   sprints   2.549403e-06 2.484293e-06 2.629347e-06 2.561769e-06 2.579702e-06 1.698829e-02
##   brocade   2.856517e-06 2.680381e-06 2.839519e-06 2.936274e-06 2.811306e-06 1.651110e-02
##   solar     3.302091e-06 3.124243e-06 3.361212e-06 1.292429e-02 3.270826e-06 3.614037e-06
##   sprint    2.578740e-06 2.513213e-06 2.669652e-06 2.595447e-06 2.609471e-06 1.278997e-02
##   reit      3.553825e-06 1.196685e-02 3.855616e-06 3.501886e-06 3.483736e-06 3.886507e-06
##   clearwire 2.549970e-06 2.484945e-06 2.630211e-06 2.562457e-06 2.580352e-06 1.193481e-02

Here you can see abandoned,abruptly etc have higher probability for topic 1.

Also we see the \(\omega\) matrix for probability of document association to topics.

options(width = 500)
simfit$omega[1:10,]

##         topic
## document          1          2         3          4         5         6
##       1  0.13290480 0.13105774 0.1318767 0.35729726 0.1209075 0.1259561
##       2  0.23640159 0.13706762 0.1484124 0.21041974 0.1342693 0.1334293
##       3  0.13676833 0.12301388 0.1227510 0.37290276 0.1251001 0.1194639
##       4  0.09920569 0.09944122 0.1006772 0.09860462 0.5015284 0.1005428
##       5  0.13465553 0.14035768 0.2964859 0.13016315 0.1426592 0.1556786
##       6  0.09969202 0.10480960 0.4542832 0.10026436 0.1099848 0.1309660
##       7  0.11668769 0.10861933 0.1301019 0.11348415 0.4139718 0.1171352
##       8  0.38743792 0.12338647 0.1222238 0.12780836 0.1241574 0.1149860
##       9  0.19793670 0.13959183 0.2197639 0.13766412 0.1675246 0.1375189
##       10 0.18527824 0.14644241 0.2087677 0.17083618 0.1542025 0.1344730

We can say Document 1 and document 3 loads heavily on topic 4 whereas document 7 loads heavily on topic 5. Document 2 is mix of topic 1 and topic 4

Some terms have high frequency, others have low frequency. We want to ensure that term frequency does not unduly influence topic weights. So we normalize term frequency in a metric called ‘lift’.

The lift of a term is topic membership probability normalized by occurrence probability of the term. If lift of a term for a topic is high, then we can say that, that term is useful in constructing that topic.

Since topics function doesn’t return lift matrix for terms we can write a simple function to calculate lift of each term.

theta = simfit$theta
lift = theta*0;  sum1 = sum(dtm1)
for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]     # term i's probability of topic j membership
    pterm = sum(dtm1[,i])/sum1   # marginal probability of term i's occurrence in corpus
    lift[i, j] = ptermtopic/pterm   # so, lift is topic membership probability normalized by occurrence probability
      }
}

We generate a wordcloud for the six topics that we chose as below

options(width = 500)

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
n = ifelse(length(freq) >= 100, 100, length(freq))
top_word = as.matrix(freq[1:n])

# Plot wordcloud
wordcloud(rownames(top_word), top_word,  scale=c(4,0.5), 1,
          random.order=FALSE, random.color=FALSE, 
          colors=brewer.pal(8, "Dark2"))
mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
}

Lets further look into the co-occurance graphs to further our insights

for (i in 1:K){       # For each topic 
a0 = which(lift[,i] > 1) # terms with lift greator than 1 for topic i
freq = theta[a0,i] # Theta for terms greator than 1
freq = sort(freq,decreasing = T) # Terms with higher probilities for topic i

# Auto Correction -  Sometime terms in topic with lift above 1 are less than 30. So auto correction
n = ifelse(length(freq) >= 20, 20, length(freq))
top_word = as.matrix(freq[1:n])

# now for top 30 words let's find Document Term Matrix
mat  = dtm1[,match(row.names(top_word),colnames(dtm1))]

mat = as.matrix(mat)
cmat  = t(mat) %*% (mat)
diag(cmat) = 0

# Let's limit number of connections to 2
for (p in 1:nrow(cmat)){
  vec = cmat[p,]
  cutoff = sort(vec, decreasing = T)[2]
  cmat[p,][cmat[p,] < cutoff] = 0
}

#cmat[cmat <  quantile(cmat,.80)] = 0

graph <- graph.adjacency(cmat, mode = "undirected",weighted=T)

plot(graph,         #the graph to be plotted
 layout=layout.fruchterman.reingold,    # the layout method. 
 vertex.frame.color='blue',         #the color of the border of the dots 
 vertex.label.color='black',        #the color of the name labels
 vertex.label.font=1,           #the font of the name labels
 vertex.size = .00001,   # Dots size
 vertex.label.cex=1.3)
 mtext(paste("Topic",i), side = 3, line = 2, cex=2)
}

eta = function(mat, dtm) {

mat1 = mat/mean(mat);  terms1 = rownames(mat1);
eta.mat = matrix(0, 1, ncol(mat1))

for (i in 1:nrow(dtm)){
    a11 = as.data.frame(matrix(dtm[i,])); 
    rownames(a11) = colnames(dtm)
    a12 = as.matrix(a11[(a11>0),]);  
    rownames(a12) = rownames(a11)[(a11>0)]; 
    rownames(a12)[1:4]
    a13 = intersect(terms1, rownames(a12)); 
    a13[1:15];  length(a13)
    a14a = match(a13, terms1);      # positions of matching terms in mat1 matrix
    a14b = match(a13, rownames(a12))        
    a15 = mat1[a14a,]*matrix(rep(a12[a14b,], 
                                 ncol(mat1)), 
                             ncol = ncol(mat1))
    eta.mat = rbind(eta.mat, apply(a15, 2, mean))   
    rm(a11, a12, a13, a14a, a14b, a15)
  }
  eta.mat = eta.mat[2:nrow(eta.mat), ]  # remove top zeroes row
  row.names(eta.mat)=row.names(dtm)
  return(eta.mat)
}

twc = eta(lift, dtm1)
head(twc)

##           1         2         3          4         5         6
## 1 0.6708877 0.5636166 0.6048141 2.14092523 0.4778560 0.5151362
## 2 0.9794480 0.3089034 0.4155501 0.79548679 0.3118430 0.2827380
## 3 0.6988186 0.4577890 0.4774151 2.27652508 0.5186728 0.4234748
## 4 0.1010131 0.1082193 0.1356469 0.08532243 4.0677151 0.1303855
## 5 0.5268186 0.5656235 1.6551426 0.44472048 0.6520185 0.7061878
## 6 0.2490611 0.2992340 3.3020952 0.22513416 0.4191098 0.6642377

Below is a segregation of companies based on the topics we chose.

eta.file.name = function(mat,calib,n) {
  s = list()                   # Blank List
  for (i in  1: ncol(mat))     # For each topic
  {
    read_doc = mat[order(mat[,i], decreasing= T),]  # Sort document prop matrix (twc)
    read_names = row.names(read_doc[1:n,])          # docuemnt index for first n document
    s[[i]] = calib[as.numeric(read_names),1]     # Store first n companies name in list  
      }
  return(s)
}

temp1 = eta.file.name(twc,textdata,5)

for (i in 1:length(temp1)){
  print(paste('Companies loading heavily on topic',i,'are'))
  print(temp1[[i]])
  print('--------------------------')
}

## [1] "Companies loading heavily on topic 1 are"
## [1] "TELEDYNE TECHNOLOGIES INC"  "CORNING INC"                "VISHAY INTERTECHNOLOGY INC" "BENCHMARK ELECTRONICS INC"  "WESTERN DIGITAL CORP"      
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 2 are"
## [1] "FACEBOOK INC"  "ECHOSTAR CORP" "YAHOO INC"     "AOL INC"       "GOOGLE INC"   
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 3 are"
## [1] "FRONTIER COMMUNICATIONS CORP" "WINDSTREAM HOLDINGS INC"      "LEVEL 3 COMMUNICATIONS INC"   "MANTECH INTL CORP"            "CENTURYLINK INC"             
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 4 are"
## [1] "FIRST SOLAR INC"        "SANDISK CORP"           "SUNPOWER CORP"          "APPLIED MATERIALS INC"  "ADVANCED MICRO DEVICES"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 5 are"
## [1] "AUTOMATIC DATA PROCESSING"   "DST SYSTEMS INC"             "COMPUTER SCIENCES CORP"      "CERNER CORP"                 "INTL BUSINESS MACHINES CORP"
## [1] "--------------------------"
## [1] "Companies loading heavily on topic 6 are"
## [1] "SPRINT CORP"                  "TELEPHONE & DATA SYSTEMS INC" "INTL GAME TECHNOLOGY"         "BROCADE COMMUNICATIONS SYS"   "LEAP WIRELESS INTL INC"      
## [1] "--------------------------"

CONCLUSION

Latent Topic 1

This maps to the output of the serVis function Topic # 3. This Latent Topic is predominantly speaking on risks related to Manufacturing of products and their demand - supply chain.

Latent Topic 2

This maps to the output of the serVis function Topic # 5. This Latent Topic is dominated by related to online and mobile advertisements.

Latent Topic 3

This Latent Topic is dominated by related to risks related to costs associated with dividends and pensions. Additionally, we can also see risks related to broadband and cable carriers

Latent Topic 4

This maps to the output of the serVis function Topic # 6. This Latent Topic is dominated by risks related to finance/merger in solar power sector.

Latent Topic 5

This maps to the output of the serVis function Topic # 4. This Latent Topic is health sector and talks of risks related to implementation of Government regulations.

Topic Analysis : 10-K Forms

Surajit Dhar, Subba Reddy Yeruva, Vaneesh Narayanan

March 19, 2016