Word Cloud for SEC Data

Antonio Rubiera
12/12/2019

Natural Language Processing of SEC Reports

The Securities and Exchange Commission requires U.S. stock-issuing, or publicly listed companies to file a large number of reports. These reports contain financial data annotated with text of varying lengths. In this shiny app, we have collected a small sample of recent annotations contained in the financial reports of three companies with different types of operations, and different styles of text annotation. Apple is here as an example of terse language, and GE is here as an example of verbose text. Walmart is included here to show a large retailer.

The shiny app is located here:

https://rubiera.shinyapps.io/shiny/

Natural Language Processing of SEC Reports

The text is transformed using the tm_map function of the tm packages, one of the NLP (Natural Language Processing) packages. After turning the text into a VCorpus in tm, we:

  • Make all text lower case
  • Remove punctuation
  • Remove numbers
  • Remove common words such as “the” and “and,” referred to in NLP as “stopwords.” We also removed the word “company.”
  • We do not stem the text because stemming, which is the process of generalizing a word from it's various conjugations, would make the word clouds less redable.

Word Cloud for Addenda Text Example from Apple

plot of chunk unnamed-chunk-1

Word Cloud for Addenda Text Example from General Electric

plot of chunk unnamed-chunk-2

The word clouds can give us a quick feel for the language of a company. For example, the word “loss” is larger, and therefore more frequent, in our text sample for General Electric, than the word “benefit.”