Regular Expression

A regular expression is a special string for describing a certain text pattern.

Character classes

Regex provides another useful constructs called character classes that are used to match a certain class of characters. The most common character classes in most regex engines are:

Character	Matches	Same as
`\\d`	any digit	`[0-9]`
`\\D`	any nondigit	`[^0-9]`
`\\w`	any character considered part of a word including the underscore character "_"	`[a-zA-Z0-9_]`
`\\W`	any character not considered part of a word	`[^a-zA-Z0-9_]`
`\\s`	any whitespace character	`[\f\n\r\t\v]`
`\\S`	any nonwhitespace character	`[^\f\n\r\t\v]`

So, we now have character classes as another type of metacharacters that can be also considered shortcuts for special character sets.

library(stringr)
triplets <- c("bts","the","BTS","The","010","070","아이폰","휴대폰",":-)","^^;")
str_view_all(triplets, "\\d{3}") # Any numbers of three digits

str_view_all(triplets, "\\D{3}") # Any three consecutive non-digit characters

str_view_all(triplets, "\\w{3}") # Any three consecutive letter/digit characters

str_view_all(triplets, "\\W{3}") # Any three consecutive non-letter/non-digit characters

str_view_all(triplets, "\\s{3}") # Any three consecutive whitespace characters

str_view_all(triplets, "\\S{3}") # Any three consecutive non-whitespace characters

Alternation

| is the alternation operator, which will pick between one or more possible matches.

str_view_all(triplets, "\\d{3}|\\D{3}")

Whitespace characters

In text pre-processing, we will encounter a variety of whitespaces that consist of different characters. Here is the table to show the characters that represent whitespaces:

Character	Description
`\f`	form feed
`\n`	line feed
`\r`	carriage return
`\t`	tab
`\v`	vertical tab

Sometimes the text contains nonprinting whitespace characters; i.e. \t, \n or \r\n. That’s why we need to use the whitespace character class \\s to match any type of whitespace characters.

Please note that Windows is the operating system that uses \r\n as an end-of-line marker, while Mac OS uses \n.

Form feed \f means advance downward to the next “page” or “section” as a separator. Carriage return \r is the action that returns to the beginning of the line.

POSIX (Portable Operating System Interface) character classes

Let me introduce another type character classes known as POSIX character classes to wrap up our work on regex. The followings are the class construct supported by the regex engine in R.

Character	Matches	Same as
`[:alnum:]`	Alphanumeric characters	`[a-zA-Z0-9]`
`[:alpha:]`	Alphabetic characters	`[a-zA-Z]`
`[:digit:]`	Digits	`[0-9]`
`[:lower:]`	Lower-case letters	`[a-z]`
`[:upper:]`	Upper-case letters	`[A-Z]`
`[:word:]`	Word characters (letters, numbers, and underscores)	`[a-zA-Z0-9_]`
`[:blank:]`	Space and tab	`[ \t]`
`[:space:]`	All whitespace characters, including line breaks	`[ \f\n\r\t\v]`
`[:punct:]`	All punctuation and symbols
`[:graph:]`	Any printable character excluding space	`[:alnum:][:punct:]`
`[:print:]`	Any printable character	`[:alnum:][:punct:][:space:]`
`[:ascii:]`	Any ASCII character (including all above)

Note that a POSIX character class is formed by an opening bracket [, followed by a colon :, followed by a keyword, followed by another colon :, and ending with a closing bracket ].

To use them in R, we have to wrap a POSIX class inside a character set. This means that we have to surround a POSIX class with another pair of brackets.

Let’s use any POSIX class to match against the vector of words triplets.

triplets

##  [1] "bts"    "the"    "BTS"    "The"    "010"    "070"    "아이폰" "휴대폰"
##  [9] ":-)"    "^^;"

str_view_all(triplets, "[[:lower:]]{3}") # Three consecutive characters

str_view_all(triplets, "[[:alpha:]]{3}")

str_view_all(triplets, "[[:digit:]]{3}")

str_view_all(triplets, "[[:punct:]]{3}")

str_view_all(triplets, "[[:punct:]^]+[[:punct:]]") # [:punct:] does not match the literal character caret "^"

str_view_all(triplets, "[[:alpha:][:digit:][:punct:]^]+") # Any single letter/digit/punctuation/caret character

How about using negation metacharacter ^

str_view_all(triplets, "[^[:alpha:][:digit:][:punct:]^]+")

str_view_all(triplets, "[^[:ascii:]]+")

Package "stringr’

Let me remind you of the functions in the package stringr covered last time.

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_trim()`	removes leading and trailing whitespace	none
`str_squish()`	removes any redundant whitespace
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

All functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Useful stringr functions for pattern matching

Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.

Function	Description
`str_which()`	Returns all positions of a matching pattern in a string vector
`str_subset()`	Returns all elements that contain a matching pattern in a string vector
`str_trunc()`	Truncates a string
`str_locate()`	Locates the first position of a matching pattern from a string
`str_locate_all()`	Locates all positions of a matching pattern from a string
`str_extact()`	Extracts the first matching pattern from a string
`str_extact_all()`	Extracts all matching patterns from a string
`str_replace()`	Replaces the first matching pattern in a string
`str_replace_all()`	Replaces all matching patterns in a string
`str_remove()`	Remove the first matched pattern in a string
`str_remove_all()`	remove all matched patterns in a string

W10-1: RWC Ch. 2 Regular expression language (Regex)

Shin Lee