This is a brief cheat sheet for Natural Language process pipeline. This is by no means, elaborate, just listing the concepts I learned from the Coursera Capstone. The content is from the Capstone project of Data Science Coursera specialization, so the credit goes to the awesome instructors and relevant sources.
A text mining framework must offer functionality for managing text documents, should abstract the process of document manipulation and ease the usage of heterogeneous text formats. Thus there is a need for a conceptual entity similar to a database holding and managing text documents in a generic way: we call this entity a text document collection or corpus.
The framework must be able to alleviate metadata usage in a convenient way, both on a document level (e.g., short summaries or descriptions of selected documents) and on a collection level (e.g., collection-wide classification tags).
The framework must provide tools and algorithms to effciently work with the documents. That means the framework has to have functionality to perform common tasks, like whitespace removal, stemming or stopword deletion. We denote such functions operating on text document collections as transformations.
Text mining typically involves doing computations on texts to gain interesting information. The most common approach is to create a so-called term-document matrix holding frequences of distinct terms for each document. Another approach is to compute directly on character sequences as is done by string kernel methods. Thus the framework must allow export mechanisms for term-document matrices and provide interfaces to access the document corpora as plain character sequences.
Text document collections (Corpus): A collection of text documents and can be interpreted as a database for texts. Its elements are TextDocuments holding the actual text corpora and local metadata. The text document collection has two slots for storing global metadata and one slot for database support.
We can distinguish two types of metadata, namely Document Metadata and Collection Meta- data.
Document metadata (DMetaData) is for information specific to text documents but with an own entity, like classification results (it holds both the classifications for each documents but in addition global information like the number of classification levels).
Collection metadata (CMetaData) is for global metadata on the collection level not necessarily related to single text documents, like the creation date of the collection (which is independent from the documents within the collection).
Text documents: The next core class is a text document (TextDocument), the basic unit managed by a text document collection. It is an abstract class, i.e., we must derive specific document classes to obtain document types we actually use in daily text mining. Basic slots are,
Text repositories: Can be used to keep track of text document collections. The class TextRepository is conceptualized for storing representations of the same text document collection. This allows to backtrack transformations on text documents and access the original input data if desired or necessary. The dynamic slot RepoMetaData can help to save the history of a text document collection, e.g., all transformations with a time stamp in form of tag-value pair metadata.
Term-document matrices: Probably the most common way of representing texts for further computation. It can be exported from a Corpus and is used as a bag-of-words mechanism which means that the order of tokens is irrelevant. This approach results in a matrix with document IDs as rows and terms as columns. The matrix elements are term frequencies. Instead of using the term frequency (weightTf) directly, one can use different weightings. The slot Weighting of a TermDocMatrix provides this facility by calling a weighting function on the matrix elements. Available weighting schemes include the binary frequency (weightBin) method which eliminates multiple entries, or the inverse document frequency (weightTfIdf) weighting giving more importance to discriminative compared to irrelevant terms.
First step is to import these texts into one’s favorite computing environment.
Second step is tidying up the texts, including preprocessing the texts to obtain a convenient representation for later analysis. This step might involve text reformatting (e.g., whitespace removal),stopword removal, or stemming procedures.
Third, the analyst must be able to transform the preprocessed texts into structured formats to be actually computed with. For classical text mining tasks, this normally implies the creation of a so-called term-document matrix, probably the most common format to represent texts for computation.
Now the analyst can work and compute on texts with standard techniques from statistics and data mining, like clustering or classification methods.
Transformations operate on each text document in a text document collection by applying a function to them. Thus we obtain another representation of the whole text document collection. Filter operations instead allow to identify subsets of the text document collection. Such a subset is defined by a function applied to each text document resulting in a Boolean answer. Hence formally the filter function is just a predicate function. This way we can easily identify documents with common characteristics.
Transformations shipped with tm package in R:
Count-based evaluation and associations of terms One of the simplest analysis methods in text mining is based on count-based evaluation. This means that those terms with the highest occurrence frequencies in a text are rated important. Another approach available in common text mining tools is nding associations for a given term, which is a further form of count-based evaluation methods.
Simple text clustering Hierarchical clustering Common similarity measures in text mining are, Metric Distances, Cosine Measure, Pearson Correlation and Extended Jaccard Similarity k-means clustering
Simple text classification In contrast to clustering, where groups are unknown at the beginning, classification tries to put specific documents into groups known in advance k-nearest neighbor classification Support vector machine classification
Text clustering with string kernels string kernels, which are methods dealing with text directly, and not anymore with an intermediate representation like term-document matrices.
Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
The rank-frequency distribution is an inverse relation.
For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.[4]
The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.