Now, we are going to learn about metacharacters. The most basic type of regex is the literal characters that match themselves. But not all characters match themselves. Any character that does not match itself is a metacharacter. This type of characters has a special meaning and they allow us to transform literal characters in very powerful ways.
Here’s the list of 15 metacharacters in regex.
.
\
|
(
)
[
]
{
}
$
-
^
*
+
?
Throughout this course, we are going to work with these metacharacters. Actually, what we need to know about regex is how these metacharacters work. Fortunately, there are only a few metacharacters to learn. Unfortunately, some metacharacters have more than one meaning. The meaning of the metacharacters depend on the context in which we use them, how we use them, and where we use them. So learning those meanings may take time and requires hours of practice.
The first metacharacter we learn about is the dot or period "."
, better known as the wild metacharacter. This metacharacter is used to match ANY character except for a new line.
For example, consider a pattern "t.e"
. This pattern will match not only the, but also tae, tee, tie, toe, and so on. But it will not match thee, tree, or tube, because the dot only matches one single character.
library(stringr)
load("covid_sent_trim.RData")
covid_sent_trim
## [1] "The use of masks is recommended for those who suspect they have the virus and their caregivers Recommendations for mask use by the general public vary, with some authorities recommending against their use some recommending their use and others requiring their use"
str_view_all(covid_sent_trim, "t.e")
The wild metacharacter is one of the most popular metacharacter in regex, but it is the source of many mistakes. Let say we want to form a regex to match "e.g"
. If you think that this pattern will match a letter e, followed by the dot . and the letter g, you will be surprised to find out that it not only matches e.g, but also eng, e g, e-g, and so on. Why? Because "."
is the metacharacter that matches absolutely anything. This shows an important fact about regex: we need to match what you want, but it should be only what we want. We want to find the thing we are looking for, but only that thing not anymore!
How can we match the character dot instead of the metacharacter, then? For instance, say we have the following character vector:
dot_words <- c("e.g", "eng", "e g", "e-g")
If we try the pattern "e.g"
, it will match all of the elements in dot_words
.
str_view_all(dot_words, "e.g")
To actually match the dot character, what we need to do is to escape the metacharacter. In most languages, the way to escape a metacharacter is by adding a backslash character in front of the metacharacter: "\."
. When we put a backslash in front of a metacharacter, we are escaping the metacharacter, this means that the character no longer has a special meaning, and it will match itself.
However, R is a bit different. Instead of putting a single backslash, we should put double backslashes: "e\\.g"
. This is because the backslash "\"
is another metacharacter so it has a special meaning in R too.
str_view_all(dot_words, "e\\.g")
So far, we have learned about metacharacters and how to escape the metacharacters. From now on, we will learn more about metacharacters and the opening and closing brackets [ ]
, used for defining a character set.
A character set matches any of the various characters that are inside the set: i.e., "[abc]"
will match the characters, “a”, “b”, or “c”, in the text. The square brackets [ ]
indicate the character set.
Note that the order of the characters inside the character set does NOT matter; what matter is the presence of the characters inside the brackets. So, the character set "[abc]"
will match any lower-case letters, “c”, “b”, or “a” in the text. And "[cba]"
will do the same thing.
Consider a regex pattern that includes a character set of vowels: "f[aeiou]n"
, and a vector with the words “fan”, “fin”, “fun”
library(stringr)
fns <- c("fan","fen","fin","fon","fun")
str_view_all(fns, "f[aeiou]n")
The set “f[aeiou]n” matches all elements in fns
. Now let’s use the same set with another vector fnx
:
fnx <- c("fan","fin","fun","f0n","f.n","f1n","fain")
str_view_all(fnx, "f[aeiou][aeiou]n")
As you can see, only the first three elements with vowel letters in fnx
are matched. And the last element “fain” was is not matched. The character set matches only one character, either “a” or “i” but not “ai”.
The above character set specifies possible characters we want to match against. But what if we want to match any letter in English alphabet, either upper-case or lower-case, or any digit?
Character ranges help us solve this problem: we have a convenient shortcut based on the hyphen metacharacter "-"
to indicate a range of characters. A character range consists of a character set with two characters separated by a hyphen "-"
sign.
So, to match any letter or number, we can define a character set formed as:
uppercase <- "[A-Z]"
lowercase <- "[a-z]"
number <- "[0-9]"
Note that the hyphen is only a metacharacter when it is inside a character set; outside the character set it is just a literal hyphen.
How, then, do we use the character range? Let’s see the following vector with triplet strings and match various occurrences of a certain type of character.
triplets <- c("bts","the","BTS","The","010","070",":-)","^^;")
str_view_all(triplets, "[a-z][a-z][a-z]") # any three consecutive lower-case letters
str_view_all(triplets, "[A-Z][A-Z][A-Z]") # any three consecutive upper-case letters
str_view_all(triplets, "[A-Z][a-z][a-z]") # any upper case letter first, followed by any two lower-case letters
str_view_all(triplets, "[0-9][0-9][0-9]") # any numbers with three consecutive digits
Note that the elements ":-)"
and "^^;"
are not matched by any of the character ranges that we have seen so far.
We can control how many times a pattern matches with the repetition operaters: {n}
: exactly n times {n,}
: n times or more {n,m}
: between n and m times ?
: 0 or 1 +
: 1 or more *
: 0 or more
str_view_all(triplets, "[a-z]{3}") # any three consecutive lower-case letters
str_view_all(triplets, "[A-Z]{2,}") # any upper-case letters repeats 2 times or more
str_view_all(triplets, "[A-Z][a-z]+") # any upper case letter first, followed by any lower-case letters
str_view_all(triplets, "[0-9]+") # any numbers with one ore more digits
When working with regex, we will have a frequent situation to match characters that are NOT part of a certain set. For example, we may want to match any character that is not part of alphabet. This type of matching can be done using a negative character set to match any one character that is not in the set. To define this type of sets, we use the metacharacter caret "^"
.
The caret "^"
is one of the metacharacters that have more than one meaning depending on where it appears in a regex pattern. If we use a caret in the first position inside a character set, i.e. "[^a-z]"
, it means negation to indicate “not any one of the following lower-case letters.” So it matches anything except lower-case letters.
So, we can match the elements ":-)"
and "^^;"
, which are neither letter nor numbers, by defining a negative character range "[^a-zA-Z0-9]"
str_view_all(triplets, "[^a-zA-Z0-9]{3}") # three consecutive negations of letters & digits
It is important to note that the caret means negation only when it comes the first inside the character set, otherwise the set is not a negative one:
str_view_all(triplets, "[a-zA-Z0-9^]{3}") # three consecutive letters/numbers/caret
In this case, the pattern "[a-zA-Z0-9^]"
means “any one letter or number or caret character,” which is completely different from the negative set "[^a-zA-Z0-9]"
that negates any one letter/number.
How can we match the literal character ^
in the last element of triplets
without a character set? Use double backslahses!
If we want to match any character except the caret, then we need to use a character set with two carets: "[^^]"
. The first caret works as a negative operator, the second caret is the caret character itself:
str_view_all(triplets, "\\^\\^;")
str_view_all(triplets, "[^^][^^][^^]") # three consecutive negations of caret
Now we know what character sets are, how to define character ranges, and how to specify negative character sets. From now on, let’s talk about what happens when including metacharacters inside character sets.
Except for the caret in the first position, any other metacharacter inside a character set is already ESCAPED!!. This means that we do not need to escape them using double backslashes inside the character set.
Consider the vector of words fnx
for example. A regex with the character set formed by "f[.aiu]n"
includes the dot character. And remember that the dot character is a metacharacter, in general, which matches any type of character. However, when the dot character is inside a character set, it loses its function as a metacharacter. So the character set only matches letters “a”, “i”, “u”, and the literal dot character “.” between “f” and “n”.
fnx
## [1] "fan" "fin" "fun" "f0n" "f.n" "f1n" "fain"
str_view_all(fnx, "f[.aiu]n") # three consecutive letters "a"/"i"/"u" or the literal dot character
Unfortunately, not all metacharacters become literal characters when they are inside a character set. There are some exceptions: the closing bracket ]
and the hyphen -
, as well as the caret ^
.
The closing bracket ]
is used to enclose the character set. So, when we want to use a literal closing bracket inside a character set, we should escape it using double backslashes "[aiu\\]]"
.
As we’ve already seen, the hyphen character -
is used to define a range of characters inside a character set: i.e. [a-d]
and [0-5]
. By the same token, we can match a literal hyphen inside a character set like: "[a\\-z]"
.
escape <- c("f^n","f]n","f-n") # We need a regex pattern to match these character patterns
str_view_all(escape, "f[\\]\\-\\^]n") # Different metacharacters can be escaped by putting double backslashes coming first
Regex provides another useful constructs called character classes that are used to match a certain class of characters. The most common character classes in most regex engines are:
Character | Matches | Same as |
---|---|---|
\\d |
any digit | [0-9] |
\\D |
any nondigit | [^0-9] |
\\w |
any character considered part of a word including the underscore character "_" | [a-zA-Z0-9_] |
\\W |
any character not considered part of a word | [^a-zA-Z0-9_] |
\\s |
any whitespace character | [\f\n\r\t\v] |
\\S |
any nonwhitespace character | [^\f\n\r\t\v] |
So, we now have character classes as another type of metacharacters that can be also considered shortcuts for special character sets.
str_view_all(triplets, "\\d{3}") # Any numbers of three digits
str_view_all(triplets, "\\D{3}") # Any three consecutive non-digit characters
str_view_all(triplets, "\\w{3}") # Any three consecutive letter/digit characters
str_view_all(triplets, "\\W{3}") # Any three consecutive non-letter/non-digit characters
str_view_all(triplets, "\\s{3}") # Any three consecutive whitespace characters
str_view_all(triplets, "\\S{3}") # Any three consecutive non-whitespace characters
|
is the alternation operator, which will pick between one or more possible matches.
library(stringr)
str_view_all(triplets, "\\d{3}|\\D{3}")
In text pre-processing, we will encounter a variety of whitespaces that consist of different characters. Here is the table to show the characters that represent whitespaces:
Character | Description |
---|---|
\f |
form feed |
\n |
line feed |
\r |
carriage return |
\t |
tab |
\v |
vertical tab |
Sometimes the text contains nonprinting whitespace characters; i.e. \t
, \n
or \r\n
. That’s why we need to use the whitespace character class \\s
to match any type of whitespace characters.
Please note that Windows is the operating system that uses \r\n
as an end-of-line marker, while Mac OS uses \n
.
Form feed \f
means advance downward to the next “page” or “section” as a separator. Carriage return \r
is the action that returns to the beginning of the line.
Let me introduce another type character classes known as POSIX character classes to wrap up our work on regex. The followings are the class construct supported by the regex engine in R.
Character | Matches | Same as |
---|---|---|
[:alnum:] |
Alphanumeric characters | [a-zA-Z0-9] |
[:alpha:] |
Alphabetic characters | [a-zA-Z] |
[:digit:] |
Digits | [0-9] |
[:lower:] |
Lower-case letters | [a-z] |
[:upper:] |
Upper-case letters | [A-Z] |
[:word:] |
Word characters (letters, numbers, and underscores) | [a-zA-Z0-9_] |
[:blank:] |
Space and tab | [ \t] |
[:space:] |
All whitespace characters, including line breaks | [ \f\n\r\t\v] |
[:punct:] |
All punctuation and symbols | |
[:graph:] |
Any printable character excluding space | [:alnum:][:punct:] |
[:print:] |
Any printable character | [:alnum:][:punct:][:space:] |
[:ascii:] |
Any ASCII character (including all above) |
Note that a POSIX character class is formed by an opening bracket [
, followed by a colon :
, followed by a keyword, followed by another colon :
, and ending with a closing bracket ]
.
To use them in R, we have to wrap a POSIX class inside a character set. This means that we have to surround a POSIX class with another pair of brackets.
Let’s use any POSIX class to match against the vector of words triplets
.
triplets
## [1] "bts" "the" "BTS" "The" "010" "070" ":-)" "^^;"
str_view_all(triplets, "[[:lower:]]{3}") # Three consecutive characters
str_view_all(triplets, "[[:alpha:]]{3}")
str_view_all(triplets, "[[:digit:]]{3}")
str_view_all(triplets, "[[:punct:]]{3}")
str_view_all(triplets, "[[:punct:]^]+[[:punct:]]") # [:punct:] does not match the literal character caret "^"
str_view_all(triplets, "[[:alpha:][:digit:][:punct:]^]+") # Any single letter/digit/punctuation/caret character
How about using negation metacharacter ^
str_view_all(triplets, "[^[:alpha:][:digit:][:punct:]^]+")
Let me remind you of the functions in the package stringr covered last time.
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_trim() |
removes leading and trailing whitespace | none |
str_squish() |
removes any redundant whitespace | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
All functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.
Function | Description |
---|---|
str_which() |
Returns all positions of a matching pattern in a string vector |
str_subset() |
Returns all elements that contain a matching pattern in a string vector |
str_trunc() |
Truncates a string |
str_locate() |
Locates the first position of a matching pattern from a string |
str_locate_all() |
Locates all positions of a matching pattern from a string |
str_extact() |
Extracts the first matching pattern from a string |
str_extact_all() |
Extracts all matching patterns from a string |
str_replace() |
Replaces the first matching pattern in a string |
str_replace_all() |
Replaces all matching patterns in a string |
str_remove() |
Remove the first matched pattern in a string |
str_remove_all() |
remove all matched patterns in a string |