Ch 5: Basic Text Processing

Learning Goals

Understand some of the basic text processing steps such as tokenization, stop word removal, stemming, and lemmatization

Basic Text (Pre-)Processing

Automated text analysis always requires some form of text processing. Consider the following example of a tweet:

Today’s the day, ladies and gents. Mr. K will land in U.S. :)

If one wants to use information from this piece of text for any form of text mining, it is important to determine what are the tokens in the text:

today, ’s, the, day, ladies, and, gents, Mr., K, will, land, in, U.S., :)

This implies a process that understands that periods in abbreviations (e.g., Mr.) and acronyms (e.g., U.S.) need to be preserved as such, but there is also punctuation that needs to be separated from the nearby tokens (comma after day or period aftter gents).

Further, a text preprocessor often normalizes the text (e.g, it may expand ’s into is or the informal gents into gentlemen), it may try to identify the root or stem of the words (e.g., lady for ladies, or be for ’s), and it may even attempt to identify and possibly label special symbols such as this emoticons: :).

Text (pre-)processing can consist of basic steps such as:

  1. Removing the HTML(HyperText Markup Language) tags from documents collected from the web

  2. Separating the punctuation from the words

  3. Removing function words as stop words (too frequent words) https://en.wikipedia.org/wiki/Function_word

  4. Applying stemming or lemmatization (the root form of the words)

This kind of text (pre-)processing steps results in a set of tokens that can be used to collect statistics or to use as input for other advanced applications such as sentiment analysis or text classification.

Note that the kind of text (pre-)processing steps is often application dependent: i.e, for analyzing the language of deception, stop words are useful and should be preserved. But to analyze the main theme of texts, stop words can be removed, and we also benefit from stemming all the input words. For identifying all the organizations that appear in a corpus, more advanced annotations are useful such as a named entity regonition tool.

Tokenization

Tokenization is the process of identifying the words in the input sequence of characters, mainly by separating the punctuation marks but also by identifying contractions, abbreviations, and so forth to maintain their intended meaning.

This tokenization process also includes text normalization steps, such as lowercasing and removing HTML tags.

The process of tokenization assumes that white spaces and punctuation are used a explicit word boundaries. But this is not the case with Korean.

“Mr. Smith doesn’t like applies.” —Tokenization—> “Mr. Smith does not like apples”

Special Attention

  1. End-of-sentence periods vs. markers of abbreviations (e.g, Mr., Dr., U.S.)

  2. Contractions and abbreviations are dependent on language: we need to compile a list of such words to make sure that the tokenization of the period is handled correctly. The same applies to apostrophe and hyphenation.

  3. For an apostrophe, we often want to identify the contractions and separate them such that they form meaningful individual words. For instance, the possessive books’ should form two words: book and s’. The contractions aren’t and he’s should be separated into are and not and he and ’s

  4. For hyphenations, we often leave them in place, to indicate a collocation as in, e.g., state-of-the-art, although sometimes it may be useful to separate it, to allow for access to individual words; e.g., separate Hewlett-Packard into Hewlett - Packard

Stop word removal

Stop words, aka function words, consist of high-frequency words including pronouns (e.g., I, we, us), determiners (e.g., the, a) prepositions (e.g., in, on), and others. For some tasks, stop words provide meaningful information: e.g., they give significant insights into people’s personalities and behaviors (Pennebaker & King, 1999). But there are also tasks when it is useful to remove them and focus the attention on content words such as nouns and verbs. In this case, we usually use a precompiled list of stop words.

But such a precompiled list of stop words can be unavaiable for a language. We can then gather word statistics on a very large corpus of texts written in the language and consequently get the top N most frequent words as candidate stop words because stop words are generally high-frequency words.

Stemming and Lemmatization

Many words in natural language are related, yet they have different surface forms due to grammatical reasons such as construction and contruct or study and studies. These relations can be captured by identifying the common stem of multiple words and this is called stemming. Stemming applies a set of rules to an input word to remove suffixes and prefixes and obtain its stem, which will now be shared with other related words. For instance, computer, computational, and computation will be all reduced to the same stem: compute. Simply saying, stemming is a processing step that uses a set of rules to remove such inflections.

But stemming often produces stems that are not valid words since suffixes or prefixes were removed. Stemming the words study and studying transforms them into studi.

The alternative to stemming is lemmatization, which reduces the inflectional forms of a word to its root form. For example, lemmatization transforms studies to study and am, are, or is to be. That is, lemmatization is the process of identifying the base form (or root form) of a word as found in a dictionary. So, unlike stemming, the output obtained from lemmatization is a valid word form; thus, its output is readable by humans, while this comes at a cost of a more computationally intensive process.