Ch. 4: Lexical Resources

Learning Objectives

Learn about the representation and content of two lexical resources, LIWC and Bing
Learn about how to tokenize texts into words using the stringr package
Learn about what regex is and how it is used for tokenization

What is a lexicon resource

Lexical resource is a collection of lexical items such as words and phrases with some additional linguistic information. A typical example of lexical resources is a dictionary, which lists words along with definitions and usage examples. An English-Korean dictionary, for instance, provides a list of English words along with their meanings and corresponding words in Korean and their usage examples. Another example is a thesaurus, which groups related words together. Yet another example is a collection of words or phrases mapped to a set of semantic classes. LIWC and Bing are part of this type of lexical resources, as they groups words or phrases together into certain categories.

Linguistic Inquiry and Word Count (LIWC)

LIWC was developed for psycholinguistic analysis, by Pennebaker and his colleagues. This lexicon has been used in a large number of studies in social, psychological, and linguistic research, addressing tasks such as analysis of psychology traits, deception, social analysis of conversations, prediction of depression, identification of sarcasm, and so on. The underlying assumption of these studies is that such social and psychological characteristics are manifested by the pattern of word usage.

The latest 2015 version of LIWC lexicon is composed of almost 6,400 words, word stems, and select emoticons. These words and stems are categoried into 90 variables tapping linguistic, psychological, and social dimensions. And each entry additionally defines one or more word categories. For example, the word ‘cried’ is part of five word categories: sadness, negative emotion, overall affect, verbs, and past focus. Hence, if the word ‘cried’ is found in the target text we want to analyze, each of these five category scale scores will be incremented. And these categories are arranged hierarchically. All sadness words, by definition, belong to the broader “negative emotion” category, as well as the “overall affect words” category.

The LIWC lexicon has been validated by showing significant correlation between human ratings of a large number of written texts and the rating obtained through LIWC-based analyses of the same texts. This means, analyzing the pattern of word frequency automatically through LIWC can substitute for manual analysis of texts by humans.

Sentiment Lexicons

Sentiment Lexicons provide a list of words annotated by a type of sentiment they contain. That is, words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.

Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.

Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.

Bing Lexicons in the `tidytext` package

There exists a number of sentiment lexicons (such as “afinn” and “nrc”) that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. Bing lexicon is one of the most popular sentiment lexicons, which is available in the tidytext package Bing Liu and collaborators.

This lexicon is based on unigrams, i.e., single words, containing many English words and the words are assigned scores for positive and negative sentiment. The bing lexicon categorizes words in a binary fashion into positive and negative categories.

All of this information is tabulated in the sentiments dataset of the tidytext package that provides a function get_sentiments("bing") to retrieve the list of words and their sentiment in the following columns:

word, an English word (unigram)
sentiment, one of either “positive” or “negative” emotions.

#install.packages("tidytext")
library(tidytext)
get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicon because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.

Tokenization using R

Cleaning texts for tokenization

White space: Simply solved!
Punctuation marks: Cannot specify what punctuation marks appear and how they are detected
Numbers: Cannot specify what numbers appear and how they are detected
Stop words: What are stop words? https://en.wikipedia.org/wiki/Stop_words
Non-English text (ASCII): How to detect non-English characters https://en.wikipedia.org/wiki/UTF-8
URLs: Cannot specify what URLs appear and how they are detected

To solve these problems, we need to take things to the next level.

Regular Expression

Last week, we learned some basic functions from the stringr package for handling and working with text in R. But in this course, we want to unleash the power of strings manipulation. So we are going to learn about regular expressions.

What are Regular Expressions?

The name “Regular Expression” does not say much. However, regular expressions are all about text. Think about how much text is all around us in our modern digital world: emails, text messages, news articles, blogs, comments, tweets—all these things are text. Regular expressions are a tool that allows us to work with these text data by describing text patterns.

A regular expression is a special string for describing a certain text pattern. In other words, a regular expression is a set of symbols that describes a set of strings. Because the term “regular expression” is rather long, most people use the word regex as a shortcut term.

It is worth noting what regular expressions are NOT. They’re NOT a programming language. They may look like some sort of programming language because they are a formal language with a defined set of rules that makes a computer do what we want it to do. However, there are no variables in regex and you can’t do computations like adding 2 + 2.

What are Regular Expressions (RegEx) used for?

We use regex to work with text. You could use regex to search a document for a word, center, spelt either as “c e n t e r” or “c e n t r e”. You could search a document and replace all occurrences of “Korea, South”, “Republic of Korea”, or “R.O.K.” with “South Korea”.

Consider the second and third problems we had in wordcloud. Our document from Wikipedia contains a lot of punctuation marks and numbers, which we may want to remove from our wordcloud. How can we detect and extract them from the document?

Actually, we already used regex to detect a term, “References” and locate where the term is in the word vector. To do so, I set the pattern of text as “References”. Likewise, if we want to detect and extract a number “2019” from the document, we can do so by the following R input: which(str_detect(covid_text_word_main, "2019"))

However, the text pattern is very specific, not generalizable. We need to remove all punctuation marks together at once because the document contains a variety of punctuation marks including questionnaire mark, comma, exclamation mark, quotation mark as well. When the size of text become large, it is hardly possible to specify all the numbers we want to remove. But using regex, we can describe what we are looking for in text. In the case of Wikipedia document, we can detect and extract any digit numbers or any punctuation mark without having to specify what we are looking for. Once we define a pattern of regex, then RStudio will return matching results.

Before getting into regex

Regular expressions may seem difficult to understand at first. You will see strings with a bunch of letters, digits, and other punctuation symbols combined in nonsensical ways. Like programming and data analysis, learning regular expressions and becoming fluent in defining regex patterns takes time and requires a lot of practice. But the more you practice, the better you will become fluent in defining more complex patterns and getting the most out of them. And regex is supported by most of other programming languages like python, perl, and Java!!

Regex Basics

Our purpose of working with regex is to describe certain patterns that match against text strings. That is to say, working with regex is all about pattern matching. the result of a match is either successful or not. So, as long as you specify a text pattern you want to detect, RStudio will return characters (or strings) that match the pattern.

As mentioned above, the simplest version of pattern matching is to search for any occurrences of some specific characters in a string. For example, we searched for the word “references” in a text document from the Wikipedia page.

But we may need to form a regex pattern with a complex structure; for example, what if we want to match against all words starting with a hashtag # or ending with “ing”? In such cases we construct regex much in the same form of arithmetic expressions.

Matching Literal Characters

Let’s begin with the simplest match of all: a literal character. A literal character match is that a given character such as the letter "A" matches the letter A. This is why it is called literal as it matches itself. This type of match is the most basic type of regex operation: just matching plain text with quotes.

Here’s some basic examples for your understanding of regex.

The first regex we work with is "the". This pattern is formed by a letter t, followed by a letter h, and ending with a letter e. But this pattern matches not only the word the but also the words they and soothe. So our regex pattern should begin and end with ‘blank’: " the "

Consider the string object: covid_sent

To have a visual representation of the actual pattern that is matched to the string object, we can use the function str_view_all() in the package stringr:

library(stringr)
covid_sent <- "   The use of masks is    recommended for those who suspect they have the virus and their caregivers   Recommendations for mask     use by the general public vary,     with some authorities recommending     against their use    some recommending their use and others requiring their use   "
covid_sent

## [1] "   The use of masks is    recommended for those who suspect they have the virus and their caregivers   Recommendations for mask     use by the general public vary,     with some authorities recommending     against their use    some recommending their use and others requiring their use   "

covid_sent_trim <- str_squish(covid_sent)
str_view_all(covid_sent_trim, " the ") # string name comes first and specified pattern of regex follows

This may seem simple but there are a couple details to be highlighted. The first is that regex searches are case sensitive. This means that the pattern "THE" would not match the in covid_sent_trim.

str_view_all(covid_sent_trim, "THE") # regex is case sensitive, so it does not match anything

Second thing is that regex counts a blank as a character: Blanks are considered literal characters. Let’s test the pattern " the "

str_view_all(covid_sent_trim, " the ") # It differntiates " the " from " they" by ending with a blank in regex

Metacharacters

Now, we are going to learn about metacharacters. The most basic type of regex is the literal characters that match themselves. But not all characters match themselves. Any character that does not match itself is a metacharacter. This type of characters has a special meaning and they allow us to transform literal characters in very powerful ways.

Here’s the list of 15 metacharacters in regex.

the dot .
the backslash \
the bar |
opening parenthesis (
closing parenthesis )
opening bracket [
closing bracket ]
opening brace {
closing brace }
the dollar sign $
the hyphen -
the caret ^
the star *
the plus sign +
the question mark ?

Throughout this course, we are going to work with these metacharacters. Actually, what we need to know about regex is how these metacharacters work. Fortunately, there are only a few metacharacters to learn. Unfortunately, some metacharacters have more than one meaning. The meaning of the metacharacters depend on the context in which we use them, how we use them, and where we use them. So learning those meanings may take time and requires hours of practice.

The Wild Metacharacter, the dot

The first metacharacter we learn about is the dot or period ".", better known as the wild metacharacter. This metacharacter is used to match ANY character except for a new line.

For example, consider a pattern "t.e". This pattern will match not only the, but also tae, tee, tie, toe, and so on. But it will not match thee, tree, or tube, because the dot only matches one single character.

covid_sent_trim

## [1] "The use of masks is recommended for those who suspect they have the virus and their caregivers Recommendations for mask use by the general public vary, with some authorities recommending against their use some recommending their use and others requiring their use"

str_view_all(covid_sent_trim, "t.e")

The wild metacharacter is one of the most popular metacharacter in regex, but it is the source of many mistakes. Let say we want to form a regex to match "e.g". If you think that this pattern will match a letter e, followed by the dot . and the letter g, you will be surprised to find out that it not only matches e.g, but also eng, e g, e-g, and so on. Why? Because "." is the metacharacter that matches absolutely anything. This shows an important fact about regex: we need to match what you want, but it should be only what we want. We want to find the thing we are looking for, but only that thing not anymore!

Escaping metacharacters, the baskslash (or Korean won sign)

How can we match the character dot instead of the metacharacter, then? For instance, say we have the following character vector:

dot_words <- c("e.g", "eng", "e g", "e-g")

If we try the pattern "e.g", it will match all of the elements in dot_words.

str_view_all(dot_words, "e.g")

To actually match the dot character, what we need to do is to escape the metacharacter. In most languages, the way to escape a metacharacter is by adding a backslash character in front of the metacharacter: "\.". When we put a backslash in front of a metacharacter, we are escaping the metacharacter, this means that the character no longer has a special meaning, and it will match itself.

However, R is a bit different. Instead of putting a single backslash, we should put double backslashes: "e\\.g". This is because the backslash "\" is another metacharacter so it has a special meaning in R too.

str_view_all(dot_words, "e\\.g")

Regex practice

So far, we have learned about metacharacters and how to escape the metacharacters. From now on, we will learn more about metacharacters and the opening and closing brackets [ ], used for defining a character set.

Character sets

A character set matches any of the various characters that are inside the set: i.e., "[abc]" will match the characters, “a”, “b”, or “c”, in the text. The square brackets [ ] indicate the character set.

Note that the order of the characters inside the character set does NOT matter; what matter is the presence of the characters inside the brackets. So, the character set "[abc]" will match any lower-case letters, “c”, “b”, or “a” in the text. And "[cba]" will do the same thing.

Defining character sets

Consider a regex pattern that includes a character set of vowels: "f[aeiou]n", and a vector with the words “fan”, “fin”, “fun”

library(stringr)
fns <- c("fan","fen","fin","fon","fun")
str_view_all(fns, "f[aeiou]n")

The set “f[aeiou]n” matches all elements in fns. Now let’s use the same set with another vector fnx:

fnx <- c("fan","fin","fun","f0n","f.n","f1n","fain")
str_view_all(fnx, "f[aeiou][aeiou]n")

As you can see, only the first three elements with vowel letters in fnx are matched. And the last element “fain” was is not matched. The character set matches only one character, either “a” or “i” but not “ai”.

Character ranges

The above character set specifies possible characters we want to match against. But what if we want to match any letter in English alphabet, either upper-case or lower-case, or any digit?

Character ranges help us solve this problem: we have a convenient shortcut based on the hyphen metacharacter "-" to indicate a range of characters. A character range consists of a character set with two characters separated by a hyphen "-" sign.

So, to match any letter or number, we can define a character set formed as:

uppercase <- "[A-Z]"

lowercase <- "[a-z]"

number <- "[0-9]"

Note that the hyphen is only a metacharacter when it is inside a character set; outside the character set it is just a literal hyphen.

How, then, do we use the character range? Let’s see the following vector with triplet strings and match various occurrences of a certain type of character.

triplets <- c("bts","the","BTS","The","010","070",":-)","^^;")
str_view_all(triplets, "[a-z][a-z][a-z]") # any three consecutive lower-case letters

str_view_all(triplets, "[A-Z][A-Z][A-Z]") # any three consecutive upper-case letters

str_view_all(triplets, "[A-Z][a-z][a-z]") # any upper case letter first, followed by any two lower-case letters

str_view_all(triplets, "[0-9][0-9][0-9]") # any numbers with three consecutive digits

Note that the elements ":-)" and "^^;" are not matched by any of the character ranges that we have seen so far.

Repetition

We can control how many times a pattern matches with the repetition operaters: {n}: exactly n times {n,}: n times or more {n,m}: between n and m times ?: 0 or 1 +: 1 or more *: 0 or more

str_view_all(triplets, "[a-z]{3}") # any three consecutive lower-case letters

str_view_all(triplets, "[A-Z]{2,}") # any upper-case letters repeats 2 times or more

str_view_all(triplets, "[A-Z][a-z]+") # any upper case letter first, followed by any lower-case letters

str_view_all(triplets, "[0-9]+") # any numbers with one ore more digits

Negative character sets

When working with regex, we will have a frequent situation to match characters that are NOT part of a certain set. For example, we may want to match any character that is not part of alphabet. This type of matching can be done using a negative character set to match any one character that is not in the set. To define this type of sets, we use the metacharacter caret "^".

The caret "^" is one of the metacharacters that have more than one meaning depending on where it appears in a regex pattern. If we use a caret in the first position inside a character set, i.e. "[^a-z]", it means negation to indicate “not any one of the following lower-case letters.” So it matches anything except lower-case letters.

So, we can match the elements ":-)" and "^^;", which are neither letter nor numbers, by defining a negative character range "[^a-zA-Z0-9]"

str_view_all(triplets, "[^a-zA-Z0-9]{3}") # three consecutive negations of letters & digits

It is important to note that the caret means negation only when it comes the first inside the character set, otherwise the set is not a negative one:

str_view_all(triplets, "[a-zA-Z0-9^]{3}") # three consecutive letters/numbers/caret

In this case, the pattern "[a-zA-Z0-9^]" means “any one letter or number or caret character,” which is completely different from the negative set "[^a-zA-Z0-9]" that negates any one letter/number.

How can we match the literal character ^ in the last element of triplets without a character set? Use double backslahses!

If we want to match any character except the caret, then we need to use a character set with two carets: "[^^]". The first caret works as a negative operator, the second caret is the caret character itself:

str_view_all(triplets, "\\^\\^;")

str_view_all(triplets, "[^^][^^][^^]") # three consecutive negations of caret

Metacharacters inside character sets

Now we know what character sets are, how to define character ranges, and how to specify negative character sets. From now on, let’s talk about what happens when including metacharacters inside character sets.

Except for the caret in the first position, any other metacharacter inside a character set is already ESCAPED!!. This means that we do not need to escape them using double backslashes inside the character set.

Consider the vector of words fnx for example. A regex with the character set formed by "f[.aiu]n" includes the dot character. And remember that the dot character is a metacharacter, in general, which matches any type of character. However, when the dot character is inside a character set, it loses its function as a metacharacter. So the character set only matches letters “a”, “i”, “u”, and the literal dot character “.” between “f” and “n”.

fnx

## [1] "fan"  "fin"  "fun"  "f0n"  "f.n"  "f1n"  "fain"

str_view_all(fnx, "f[.aiu]n") # three consecutive letters "a"/"i"/"u" or the literal dot character

Unfortunately, not all metacharacters become literal characters when they are inside a character set. There are some exceptions: the closing bracket ] and the hyphen -, as well as the caret ^.

The closing bracket ] is used to enclose the character set. So, when we want to use a literal closing bracket inside a character set, we should escape it using double backslashes "[aiu\\]]".

As we’ve already seen, the hyphen character - is used to define a range of characters inside a character set: i.e. [a-d] and [0-5]. By the same token, we can match a literal hyphen inside a character set like: "[a\\-z]".

escape <- c("f^n","f]n","f-n") # We need a regex pattern to match these character patterns
str_view_all(escape, "f[\\]\\-\\^]n") # Different metacharacters can be escaped by putting double backslashes coming first

Character classes

Regex provides another useful constructs called character classes that are used to match a certain class of characters. The most common character classes in most regex engines are:

Character	Matches	Same as
`\\d`	any digit	`[0-9]`
`\\D`	any nondigit	`[^0-9]`
`\\w`	any character considered part of a word including the underscore character "_"	`[a-zA-Z0-9_]`
`\\W`	any character not considered part of a word	`[^a-zA-Z0-9_]`
`\\s`	any whitespace character	`[\f\n\r\t\v]`
`\\S`	any nonwhitespace character	`[^\f\n\r\t\v]`

So, we now have character classes as another type of metacharacters that can be also considered shortcuts for special character sets.

str_view_all(triplets, "\\d{3}") # Any numbers of three digits

str_view_all(triplets, "\\D{3}") # Any three consecutive non-digit characters

str_view_all(triplets, "\\w{3}") # Any three consecutive letter/digit characters

str_view_all(triplets, "\\W{3}") # Any three consecutive non-letter/non-digit characters

str_view_all(triplets, "\\s{3}") # Any three consecutive whitespace characters

str_view_all(triplets, "\\S{3}") # Any three consecutive non-whitespace characters

Alternation

| is the alternation operator, which will pick between one or more possible matches.

library(stringr)
str_view_all(triplets, "\\d{3}|\\D{3}")

Whitespace characters

In text pre-processing, we will encounter a variety of whitespaces that consist of different characters. Here is the table to show the characters that represent whitespaces:

Character	Description
`\f`	form feed
`\n`	line feed
`\r`	carriage return
`\t`	tab
`\v`	vertical tab

Sometimes the text contains nonprinting whitespace characters; i.e. \t, \n or \r\n. That’s why we need to use the whitespace character class \\s to match any type of whitespace characters.

Please note that Windows is the operating system that uses \r\n as an end-of-line marker, while Mac OS uses \n.

Form feed \f means advance downward to the next “page” or “section” as a separator. Carriage return \r is the action that returns to the beginning of the line.

POSIX (Portable Operating System Interface) character classes

Let me introduce another type character classes known as POSIX character classes to wrap up our work on regex. The followings are the class construct supported by the regex engine in R.

Character	Matches	Same as
`[:alnum:]`	Alphanumeric characters	`[a-zA-Z0-9]`
`[:alpha:]`	Alphabetic characters	`[a-zA-Z]`
`[:digit:]`	Digits	`[0-9]`
`[:lower:]`	Lower-case letters	`[a-z]`
`[:upper:]`	Upper-case letters	`[A-Z]`
`[:word:]`	Word characters (letters, numbers, and underscores)	`[a-zA-Z0-9_]`
`[:blank:]`	Space and tab	`[ \t]`
`[:space:]`	All whitespace characters, including line breaks	`[ \f\n\r\t\v]`
`[:punct:]`	All punctuation and symbols
`[:graph:]`	Any printable character excluding space	`[:alnum:][:punct:]`
`[:print:]`	Any printable character	`[:alnum:][:punct:][:space:]`
`[:ascii:]`	Any ASCII character (including all above)

Note that a POSIX character class is formed by an opening bracket [, followed by a colon :, followed by a keyword, followed by another colon :, and ending with a closing bracket ].

To use them in R, we have to wrap a POSIX class inside a character set. This means that we have to surround a POSIX class with another pair of brackets.

Let’s use any POSIX class to match against the vector of words triplets.

triplets

## [1] "bts" "the" "BTS" "The" "010" "070" ":-)" "^^;"

str_view_all(triplets, "[[:lower:]]{3}") # Three consecutive characters

str_view_all(triplets, "[[:alpha:]]{3}")

str_view_all(triplets, "[[:digit:]]{3}")

str_view_all(triplets, "[[:punct:]]{3}")

str_view_all(triplets, "[[:punct:]^]+[[:punct:]]") # [:punct:] does not match the literal character caret "^"

str_view_all(triplets, "[[:alpha:][:digit:][:punct:]^]+") # Any single letter/digit/punctuation/caret character

How about using negation metacharacter ^

str_view_all(triplets, "[^[:alpha:][:digit:][:punct:]^]+")

Package "stringr’

Let me remind you of the functions in the package stringr covered last time.

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_trim()`	removes leading and trailing whitespace	none
`str_squish()`	removes any redundant whitespace
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

All functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Useful stringr functions for pattern matching

Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.

Function	Description
`str_which()`	Returns all positions of a matching pattern in a string vector
`str_subset()`	Returns all elements that contain a matching pattern in a string vector
`str_trunc()`	Truncates a string
`str_locate()`	Locates the first position of a matching pattern from a string
`str_locate_all()`	Locates all positions of a matching pattern from a string
`str_extact()`	Extracts the first matching pattern from a string
`str_extact_all()`	Extracts all matching patterns from a string
`str_replace()`	Replaces the first matching pattern in a string
`str_replace_all()`	Replaces all matching patterns in a string
`str_remove()`	Remove the first matched pattern in a string
`str_remove_all()`	remove all matched patterns in a string

Week5: Lexical Resources and Tokenization using R