Regular Expression

A regular expression is a special string for describing a certain text pattern.

Character classes

Regex provides another useful constructs called character classes that are used to match a certain class of characters. The most common character classes in most regex engines are:

Character Matches Same as
\\d any digit [0-9]
\\D any nondigit [^0-9]
\\w any character considered part of a word including the underscore character "_" [a-zA-Z0-9_]
\\W any character not considered part of a word [^a-zA-Z0-9_]
\\s any whitespace character [\f\n\r\t\v]
\\S any nonwhitespace character [^\f\n\r\t\v]

So, we now have character classes as another type of metacharacters that can be also considered shortcuts for special character sets.

library(stringr)
triplets <- c("bts","the","BTS","The","010","070","아이폰","휴대폰",":-)","^^;")
str_view_all(triplets, "\\d{3}") # Any numbers of three digits
str_view_all(triplets, "\\D{3}") # Any three consecutive non-digit characters
str_view_all(triplets, "\\w{3}") # Any three consecutive letter/digit characters
str_view_all(triplets, "\\W{3}") # Any three consecutive non-letter/non-digit characters 
str_view_all(triplets, "\\s{3}") # Any three consecutive whitespace characters
str_view_all(triplets, "\\S{3}") # Any three consecutive non-whitespace characters

Alternation

| is the alternation operator, which will pick between one or more possible matches.

str_view_all(triplets, "\\d{3}|\\D{3}")

Whitespace characters

In text pre-processing, we will encounter a variety of whitespaces that consist of different characters. Here is the table to show the characters that represent whitespaces:

Character Description
\f form feed
\n line feed
\r carriage return
\t tab
\v vertical tab

Sometimes the text contains nonprinting whitespace characters; i.e. \t, \n or \r\n. That’s why we need to use the whitespace character class \\s to match any type of whitespace characters.

Please note that Windows is the operating system that uses \r\n as an end-of-line marker, while Mac OS uses \n.

Form feed \f means advance downward to the next “page” or “section” as a separator. Carriage return \r is the action that returns to the beginning of the line.

POSIX (Portable Operating System Interface) character classes

Let me introduce another type character classes known as POSIX character classes to wrap up our work on regex. The followings are the class construct supported by the regex engine in R.

Character Matches Same as
[:alnum:] Alphanumeric characters [a-zA-Z0-9]
[:alpha:] Alphabetic characters [a-zA-Z]
[:digit:] Digits [0-9]
[:lower:] Lower-case letters [a-z]
[:upper:] Upper-case letters [A-Z]
[:word:] Word characters (letters, numbers, and underscores) [a-zA-Z0-9_]
[:blank:] Space and tab [ \t]
[:space:] All whitespace characters, including line breaks [ \f\n\r\t\v]
[:punct:] All punctuation and symbols
[:graph:] Any printable character excluding space [:alnum:][:punct:]
[:print:] Any printable character [:alnum:][:punct:][:space:]
[:ascii:] Any ASCII character (including all above)

Note that a POSIX character class is formed by an opening bracket [, followed by a colon :, followed by a keyword, followed by another colon :, and ending with a closing bracket ].

To use them in R, we have to wrap a POSIX class inside a character set. This means that we have to surround a POSIX class with another pair of brackets.

Let’s use any POSIX class to match against the vector of words triplets.

triplets
##  [1] "bts"    "the"    "BTS"    "The"    "010"    "070"    "아이폰" "휴대폰"
##  [9] ":-)"    "^^;"
str_view_all(triplets, "[[:lower:]]{3}") # Three consecutive characters
str_view_all(triplets, "[[:alpha:]]{3}")
str_view_all(triplets, "[[:digit:]]{3}")
str_view_all(triplets, "[[:punct:]]{3}")
str_view_all(triplets, "[[:punct:]^]+[[:punct:]]") # [:punct:] does not match the literal character caret "^"
str_view_all(triplets, "[[:alpha:][:digit:][:punct:]^]+") # Any single letter/digit/punctuation/caret character

How about using negation metacharacter ^

str_view_all(triplets, "[^[:alpha:][:digit:][:punct:]^]+")
str_view_all(triplets, "[^[:ascii:]]+")

Package "stringr’

Let me remind you of the functions in the package stringr covered last time.

Function Description Similar Base Functions
str_length() number of characters nchar()
str_split() split up a string into pieces strsplit()
str_c() string concatenation paste()
str_trim() removes leading and trailing whitespace none
str_squish() removes any redundant whitespace
str_detect() finds a particular pattern of characters
str_view_all() show the matching result on the actual screen

All functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Useful stringr functions for pattern matching

Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.

Function Description
str_which() Returns all positions of a matching pattern in a string vector
str_subset() Returns all elements that contain a matching pattern in a string vector
str_trunc() Truncates a string
str_locate() Locates the first position of a matching pattern from a string
str_locate_all() Locates all positions of a matching pattern from a string
str_extact() Extracts the first matching pattern from a string
str_extact_all() Extracts all matching patterns from a string
str_replace() Replaces the first matching pattern in a string
str_replace_all() Replaces all matching patterns in a string
str_remove() Remove the first matched pattern in a string
str_remove_all() remove all matched patterns in a string