A regular expression is a special string for describing a certain text pattern.
Regex provides another useful constructs called character classes that are used to match a certain class of characters. The most common character classes in most regex engines are:
Character | Matches | Same as |
---|---|---|
\\d |
any digit | [0-9] |
\\D |
any nondigit | [^0-9] |
\\w |
any character considered part of a word including the underscore character "_" | [a-zA-Z0-9_] |
\\W |
any character not considered part of a word | [^a-zA-Z0-9_] |
\\s |
any whitespace character | [\f\n\r\t\v] |
\\S |
any nonwhitespace character | [^\f\n\r\t\v] |
So, we now have character classes as another type of metacharacters that can be also considered shortcuts for special character sets.
library(stringr)
triplets <- c("bts","the","BTS","The","010","070","아이폰","휴대폰",":-)","^^;")
str_view_all(triplets, "\\d{3}") # Any numbers of three digits
str_view_all(triplets, "\\D{3}") # Any three consecutive non-digit characters
str_view_all(triplets, "\\w{3}") # Any three consecutive letter/digit characters
str_view_all(triplets, "\\W{3}") # Any three consecutive non-letter/non-digit characters
str_view_all(triplets, "\\s{3}") # Any three consecutive whitespace characters
str_view_all(triplets, "\\S{3}") # Any three consecutive non-whitespace characters
|
is the alternation operator, which will pick between one or more possible matches.
str_view_all(triplets, "\\d{3}|\\D{3}")
In text pre-processing, we will encounter a variety of whitespaces that consist of different characters. Here is the table to show the characters that represent whitespaces:
Character | Description |
---|---|
\f |
form feed |
\n |
line feed |
\r |
carriage return |
\t |
tab |
\v |
vertical tab |
Sometimes the text contains nonprinting whitespace characters; i.e. \t
, \n
or \r\n
. That’s why we need to use the whitespace character class \\s
to match any type of whitespace characters.
Please note that Windows is the operating system that uses \r\n
as an end-of-line marker, while Mac OS uses \n
.
Form feed \f
means advance downward to the next “page” or “section” as a separator. Carriage return \r
is the action that returns to the beginning of the line.
Let me introduce another type character classes known as POSIX character classes to wrap up our work on regex. The followings are the class construct supported by the regex engine in R.
Character | Matches | Same as |
---|---|---|
[:alnum:] |
Alphanumeric characters | [a-zA-Z0-9] |
[:alpha:] |
Alphabetic characters | [a-zA-Z] |
[:digit:] |
Digits | [0-9] |
[:lower:] |
Lower-case letters | [a-z] |
[:upper:] |
Upper-case letters | [A-Z] |
[:word:] |
Word characters (letters, numbers, and underscores) | [a-zA-Z0-9_] |
[:blank:] |
Space and tab | [ \t] |
[:space:] |
All whitespace characters, including line breaks | [ \f\n\r\t\v] |
[:punct:] |
All punctuation and symbols | |
[:graph:] |
Any printable character excluding space | [:alnum:][:punct:] |
[:print:] |
Any printable character | [:alnum:][:punct:][:space:] |
[:ascii:] |
Any ASCII character (including all above) |
Note that a POSIX character class is formed by an opening bracket [
, followed by a colon :
, followed by a keyword, followed by another colon :
, and ending with a closing bracket ]
.
To use them in R, we have to wrap a POSIX class inside a character set. This means that we have to surround a POSIX class with another pair of brackets.
Let’s use any POSIX class to match against the vector of words triplets
.
triplets
## [1] "bts" "the" "BTS" "The" "010" "070" "아이폰" "휴대폰"
## [9] ":-)" "^^;"
str_view_all(triplets, "[[:lower:]]{3}") # Three consecutive characters
str_view_all(triplets, "[[:alpha:]]{3}")
str_view_all(triplets, "[[:digit:]]{3}")
str_view_all(triplets, "[[:punct:]]{3}")
str_view_all(triplets, "[[:punct:]^]+[[:punct:]]") # [:punct:] does not match the literal character caret "^"
str_view_all(triplets, "[[:alpha:][:digit:][:punct:]^]+") # Any single letter/digit/punctuation/caret character
How about using negation metacharacter ^
str_view_all(triplets, "[^[:alpha:][:digit:][:punct:]^]+")
str_view_all(triplets, "[^[:ascii:]]+")
Let me remind you of the functions in the package stringr covered last time.
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_trim() |
removes leading and trailing whitespace | none |
str_squish() |
removes any redundant whitespace | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
All functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.
Function | Description |
---|---|
str_which() |
Returns all positions of a matching pattern in a string vector |
str_subset() |
Returns all elements that contain a matching pattern in a string vector |
str_trunc() |
Truncates a string |
str_locate() |
Locates the first position of a matching pattern from a string |
str_locate_all() |
Locates all positions of a matching pattern from a string |
str_extact() |
Extracts the first matching pattern from a string |
str_extact_all() |
Extracts all matching patterns from a string |
str_replace() |
Replaces the first matching pattern in a string |
str_replace_all() |
Replaces all matching patterns in a string |
str_remove() |
Remove the first matched pattern in a string |
str_remove_all() |
remove all matched patterns in a string |