Regular Expressions in R

Victor Ordu

Beginnings

Check the following collection of words:

begin beige beijing beging bring boing banger

It can be seen from this collection that there is a recurring pattern in the words.

begin beige beijing beging bring boing banger

In elementary math, such arrangements might lead to the following questions:
- How many words end in ‘ing’?
- Which words end in ‘ing’?
- Which words do not contain ‘ing’?

With regard to these quesions, the character sequence i-n-g is known as a regular expression (or regex).
With regular expressions, we programmatically provide answers to these questions.

What is a Regular Expression?

A character pattern that is used to match text
Essentially, letters and symbols used to identify strings
Core regex syntax essentially the same across programming languages
Two types of characters used:
- Literals e.g. A to Z, 0 to 9
- Metacharacters e.g. ^, $, +, ?, etc

Literal or “Regular” Characters

Easy way to use regex

a matches ‘apple’, ‘bag’, ‘hat’, ‘dam’

pp also matches ‘apple’

cat matches ‘catch’, ‘locate’, ‘ducat’, and of course, ‘cat’

Metacharacters

Metacharacters are where the real power of regex resides. There are various types:

Character matching
Quantifiers
Anchors
Grouping
Character classes
Escapes

Character matching

. (dot) matches ANY character

\w matches any alphanumeric character

\d matches any digit (0 through 9)

\s matches whitespace (including tabs, newlines, etc.)

The negations of these are \W, \D, \S

Quantifiers

Used to indicate how many times a given character is matched
For example, “How many times does the letter ‘q’ occur?”

* means zero or more

? means at least one time

+ means one or more times

{n} means “exactly n times”

{n,} means “n or more times”

{n,m} means “between n and m times”

Anchors

These characters are used to define the bounds of strings

^ indicates the start of a line

$ indicates the end of a line

\b indicates the bounds of a “word”

Grouping

These characters are used for grouping regex characters:

(...) - any set of characters bounded by parentheses is taken as a unit

Character classes

This is a useful construct for

[xyz] is taken as match on “any of x, y, or z”

This would match “zoo”, “xenon”, “eyes”

Character classes also allow us to use ranges:

[1-3] will match “156” and “562” but will not match “094”

To match all the English lower case letters we can use [a-z]

We also have named character classes:

[:alnum:] - any alphanumeric characater (same as [A-Za-z0-9])

[:alpha:] - any English letter (same as [A-Za-z])

[:upper:] - any upper case character (same as [A-Z])

[:lower:] - any lower case character (same as [a-z])

[:digit:] - 0 through 9 (same as [0-9])

… and more

Escapes

In regex, we can escape characters with the backslash (\).
Escaping is particularly important when you want to match a character that also serves as a regex metacharacter.

For instance:
- For the string “M.Sc.” the regex M.Sc. might not yield the desired result.
- This is because . will match any character
- A better regex would be M\.Sc\..
- The backslash ensure that we are exactly mathing the periods (.).
To match a backslash, double it i.e. \\.
- In R’s regex, always escape the backslash!

Lookaround (Advanced)

This type of regex examines the neighbouring characters of a regex to decide on a match.

Positive Lookahead
X(?=Y) - is match if X is followed by Y (where X or Y can be one or more characters)

Negative Lookahead
X(?!Y) - is a match if X is not followed by Y

Positive Lookbehind (?<=Y)X - is a match if X is preceded by Y

Negative Lookbehind
(?<!Y)X - is a match only if X is not preceded by Y

General Recap

^abc$ — exactly "abc"
[0-9]{4} — “contains four digits”
\bword\b — whole word “word”
Both \w{5} and [:alnum:]{5}
- will match “black”
- will not match “blue3”

Quick Syntax Cheat Sheet

`.`	any character
`*`	0 or more
`+`	1 or more
`?`	0 or 1
`[]`	character class
`()`	capture group
`{n}`	exactly n times
`{n,}`	n or more times
`{n,m}`	between n and m times
`\|`	OR
`\b`	word boundary
`\d`	digit (shortcut)
`\s`	whitespace (space, tab)

Why Regular Expressions?

Extract patterns from (messy) text
Clean and validate input
Useful for “Find and Replace” actions e.g. in IDEs
Example use cases:
- Extract phone numbers
- Clean up inconsistent names
- Validate email formats