Ch. 4: Lexical Resources

Learning Objectives

  1. Learn about the representation and content of two lexical resources, LIWC and Bing
  2. Learn about how to tokenize texts into words using the stringr package
  3. Learn about what regex is and how it is used for tokenization

What is a lexicon resource

Lexical resource is a collection of lexical items such as words and phrases with some additional linguistic information. A typical example of lexical resources is a dictionary, which lists words along with definitions and usage examples. An English-Korean dictionary, for instance, provides a list of English words along with their meanings and corresponding words in Korean and their usage examples. Another example is a thesaurus, which groups related words together. Yet another example is a collection of words or phrases mapped to a set of semantic classes. LIWC and Bing are part of this type of lexical resources, as they groups words or phrases together into certain categories.

Linguistic Inquiry and Word Count (LIWC)

LIWC was developed for psycholinguistic analysis, by Pennebaker and his colleagues. This lexicon has been used in a large number of studies in social, psychological, and linguistic research, addressing tasks such as analysis of psychology traits, deception, social analysis of conversations, prediction of depression, identification of sarcasm, and so on. The underlying assumption of these studies is that such social and psychological characteristics are manifested by the pattern of word usage.

The latest 2015 version of LIWC lexicon is composed of almost 6,400 words, word stems, and select emoticons. These words and stems are categoried into 90 variables tapping linguistic, psychological, and social dimensions. And each entry additionally defines one or more word categories. For example, the word ‘cried’ is part of five word categories: sadness, negative emotion, overall affect, verbs, and past focus. Hence, if the word ‘cried’ is found in the target text we want to analyze, each of these five category scale scores will be incremented. And these categories are arranged hierarchically. All sadness words, by definition, belong to the broader “negative emotion” category, as well as the “overall affect words” category.

The LIWC lexicon has been validated by showing significant correlation between human ratings of a large number of written texts and the rating obtained through LIWC-based analyses of the same texts. This means, analyzing the pattern of word frequency automatically through LIWC can substitute for manual analysis of texts by humans.

Sentiment Lexicons

Sentiment Lexicons provide a list of words annotated by a type of sentiment they contain. That is, words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.

Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.

Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.

Bing Lexicons in the tidytext package

There exists a number of sentiment lexicons (such as “afinn” and “nrc”) that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. Bing lexicon is one of the most popular sentiment lexicons, which is available in the tidytext package Bing Liu and collaborators.

This lexicon is based on unigrams, i.e., single words, containing many English words and the words are assigned scores for positive and negative sentiment. The bing lexicon categorizes words in a binary fashion into positive and negative categories.

All of this information is tabulated in the sentiments dataset of the tidytext package that provides a function get_sentiments("bing") to retrieve the list of words and their sentiment in the following columns:

  • word, an English word (unigram)
  • sentiment, one of either “positive” or “negative” emotions.
#install.packages("tidytext")
library(tidytext)
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
  • Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicon because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.

Tokenization using R

Cleaning texts for tokenization

  1. White space: Simply solved!
  2. Punctuation marks: Cannot specify what punctuation marks appear and how they are detected
  3. Numbers: Cannot specify what numbers appear and how they are detected
  4. Stop words: What are stop words? https://en.wikipedia.org/wiki/Stop_words
  5. Non-English text (ASCII): How to detect non-English characters https://en.wikipedia.org/wiki/UTF-8
  6. URLs: Cannot specify what URLs appear and how they are detected

To solve these problems, we need to take things to the next level.

Regular Expression

Last week, we learned some basic functions from the stringr package for handling and working with text in R. But in this course, we want to unleash the power of strings manipulation. So we are going to learn about regular expressions.

What are Regular Expressions?

The name “Regular Expression” does not say much. However, regular expressions are all about text. Think about how much text is all around us in our modern digital world: emails, text messages, news articles, blogs, comments, tweets—all these things are text. Regular expressions are a tool that allows us to work with these text data by describing text patterns.

A regular expression is a special string for describing a certain text pattern. In other words, a regular expression is a set of symbols that describes a set of strings. Because the term “regular expression” is rather long, most people use the word regex as a shortcut term.

It is worth noting what regular expressions are NOT. They’re NOT a programming language. They may look like some sort of programming language because they are a formal language with a defined set of rules that makes a computer do what we want it to do. However, there are no variables in regex and you can’t do computations like adding 2 + 2.

What are Regular Expressions (RegEx) used for?

We use regex to work with text. You could use regex to search a document for a word, center, spelt either as “c e n t e r” or “c e n t r e”. You could search a document and replace all occurrences of “Korea, South”, “Republic of Korea”, or “R.O.K.” with “South Korea”.

Consider the second and third problems we had in wordcloud. Our document from Wikipedia contains a lot of punctuation marks and numbers, which we may want to remove from our wordcloud. How can we detect and extract them from the document?

Actually, we already used regex to detect a term, “References” and locate where the term is in the word vector. To do so, I set the pattern of text as “References”. Likewise, if we want to detect and extract a number “2019” from the document, we can do so by the following R input: which(str_detect(covid_text_word_main, "2019"))

However, the text pattern is very specific, not generalizable. We need to remove all punctuation marks together at once because the document contains a variety of punctuation marks including questionnaire mark, comma, exclamation mark, quotation mark as well. When the size of text become large, it is hardly possible to specify all the numbers we want to remove. But using regex, we can describe what we are looking for in text. In the case of Wikipedia document, we can detect and extract any digit numbers or any punctuation mark without having to specify what we are looking for. Once we define a pattern of regex, then RStudio will return matching results.

Before getting into regex

Regular expressions may seem difficult to understand at first. You will see strings with a bunch of letters, digits, and other punctuation symbols combined in nonsensical ways. Like programming and data analysis, learning regular expressions and becoming fluent in defining regex patterns takes time and requires a lot of practice. But the more you practice, the better you will become fluent in defining more complex patterns and getting the most out of them. And regex is supported by most of other programming languages like python, perl, and Java!!

Regex Basics

Our purpose of working with regex is to describe certain patterns that match against text strings. That is to say, working with regex is all about pattern matching. the result of a match is either successful or not. So, as long as you specify a text pattern you want to detect, RStudio will return characters (or strings) that match the pattern.

As mentioned above, the simplest version of pattern matching is to search for any occurrences of some specific characters in a string. For example, we searched for the word “references” in a text document from the Wikipedia page.

But we may need to form a regex pattern with a complex structure; for example, what if we want to match against all words starting with a hashtag # or ending with “ing”? In such cases we construct regex much in the same form of arithmetic expressions.

Matching Literal Characters

Let’s begin with the simplest match of all: a literal character. A literal character match is that a given character such as the letter "A" matches the letter A. This is why it is called literal as it matches itself. This type of match is the most basic type of regex operation: just matching plain text with quotes.

Here’s some basic examples for your understanding of regex.

The first regex we work with is "the". This pattern is formed by a letter t, followed by a letter h, and ending with a letter e. But this pattern matches not only the word the but also the words they and soothe. So our regex pattern should begin and end with ‘blank’: " the "

Consider the string object: covid_sent

To have a visual representation of the actual pattern that is matched to the string object, we can use the function str_view_all() in the package stringr:

library(stringr)
covid_sent <- "   The use of masks is    recommended for those who suspect they have the virus and their caregivers   Recommendations for mask     use by the general public vary,     with some authorities recommending     against their use    some recommending their use and others requiring their use   "
covid_sent
## [1] "   The use of masks is    recommended for those who suspect they have the virus and their caregivers   Recommendations for mask     use by the general public vary,     with some authorities recommending     against their use    some recommending their use and others requiring their use   "
covid_sent_trim <- str_squish(covid_sent)
str_view_all(covid_sent_trim, " the ") # string name comes first and specified pattern of regex follows

This may seem simple but there are a couple details to be highlighted. The first is that regex searches are case sensitive. This means that the pattern "THE" would not match the in covid_sent_trim.

str_view_all(covid_sent_trim, "THE") # regex is case sensitive, so it does not match anything

Second thing is that regex counts a blank as a character: Blanks are considered literal characters. Let’s test the pattern " the "

str_view_all(covid_sent_trim, " the ") # It differntiates " the " from " they" by ending with a blank in regex 
save(covid_sent_trim, file="covid_sent_trim.RData")