Think back to the last time that you opened your Email Inbox because you were looking for a specific email that you received last year. You know that the sender’s first name was ‘Jason’ and that the email in question revolved around the topic of SCADA data, but not much more than that. What’s more, there are 20,000 emails in your inbox, so it’s not like you can go through them one by one. What do you do? Most likely you would look for a search bar in the Email Application you are using and search for ‘From:Jason received:last year contents:SCADA’. With any luck, this will narrow down the number of emails that you need to search through to a far more reasonable number. This is an exceptionally powerful and useful feature, and it is worth paying attention to, because there are many cases that don’t provide such a nice pre-built search tool. Suppose, in a real example, that you have an CSV file with 20,000 rows that contains free-form text feedback in one of the columns, and you need to group each text response into one of five possible categories. What do you do?
- Fully read through each (sometimes multi-sentence) response and assign each to the correct category
- Skim through each response (i.e. first phrase or first sentence) and assign each to a given category
- Something else
When dealing with situations like this, it is important to acknowledge that reading through each and every response is really not a reasonable solution. Not only would this take an excruciatingly long amount of time, it will also likely be fraught with mistakes that you invariably will make as you grow tired, annoyed, and/or bored. You’re also highly likely to produce many mistakes by skimming (not to mention this would still take an exceptionally long amount of time to do). Instead, it is far more efficient to do something else; specifically, to filter the list of responses according to certain criteria that you could use to associate certain bodies of text to a given category (just like you filtered for a given Email in the previous example). This kind of filtering/manipulation can be easily accomplished using Regular Expressions.
So, what are Regular Expressions? Regular Expressions (often referred to as regex, or rational expressions) are specific character sequence patterns that define a search pattern that string-searching algorithms can understand. What’s more, most applications that implement some form of a search bar makes use of at least one flavor of Regular Expressions. They are also highly versatile, and can be used to filter, manipulate, and even extract specific pieces of information from larger bodies of text. Over the course of this document, we shall introduce the basic concepts behind Regular Expressions, and work through a number of test-cases showcasing how they work and how they can be applied in many different scenarios.
Basic Concepts
Regular Expressions can be used to specify/identify a certain set or pattern of characters for the purpose of filtering, manipulation, or extraction. This can be thought of as a set of Querying rules that can be used on a body of text, similar to how one can query a more structured/standard dataset using SQL, inequalities, and the like (e.g. filter out all records whose x-variable is less than 4). The (basic) building blocks for these string query-patterns are laid out below in the following table. It should be noted that what follows is not an exhaustive list; for more comprehensive details, see Regular-Expressions.info, though I should note the the full content spans more than 300 pages when collapsed to PDF (just a warning - it also covers the many different `flavors’ of Regular Expression syntax, a topic that I will not cover here, short of saying you can think of them as different dialects).
| Boolean ‘OR’ |
| |
Standard OR operation separating possible string patterns. |
‘cat|dog’ |
‘cat’, ‘dog’ |
| Grouping (Op Precedence) |
(…) |
Used to mark/group together and define the scope of certain operations (i.e. as in algebra) |
‘mea(n|t)s’ |
‘means’, ‘meats’ |
| Bracketed Inclusion |
[…] |
Single character match contained within the brackets. Allows the use of ‘-’ to specify ranges, like [0-9] or [a-z]. |
‘T[0-4]’ |
‘T0’, ‘T1’, ‘T2’, ‘T3’, ‘T4’ |
| Wildcard |
. |
The ‘.’ represents a wildcard that could match ANY character |
‘Test .A’ |
‘Test AA’, ‘Test 1A’, ‘Test aA’, … |
| Possible Existence |
? |
The ‘?’ indicates that the preceding element occurs zero or one time in the search pattern. |
‘colou?rs?’ |
‘color’, ‘colour’, ‘colors’, ‘colours’ |
| Arbitrary Repetition |
* |
The ’*’ indicates that the preceding element occurs zero or more times in the search pattern. |
’ab*c’ |
‘ac’, ‘abc’, ‘abbc’, ‘abbbc’, … |
| Mandatory Repetition |
+ |
The ‘+’ indicates that the preceding element occurs one or more times in the search pattern. |
‘4+’ |
‘4’, ‘44’, ‘444’, ‘4444’, … |
| Repetition Range |
{n,m} |
The {n,m} indicates that the preceding element occurs at least n times and at most m times in the search pattern. |
‘={4,6}’ |
‘====’, ‘=====’, ‘======’ |
| Backreference |
\n |
The ‘\n’ is replaced with the contents of the n-th marked subexpression (only valid for n between 0 and 9) |
([a-z]+) \1 |
Any text with a duplicated substring. |
String Pattern Filtering
strOptions <- c("1 cat", "2 dogs", "3 cats", "four horses")
# NOTE: The grep function is basically the 'string-search' function in R.
# It is called in the following way: grep(SearchPattern, Text) -> Matches
# We'll use it to filter out all the options without the string sequence 'cat'.
grep("cat", strOptions, value = TRUE)
## [1] "1 cat" "3 cats"
The standard approach to string literal filtering/comparison is arguably the simplest to understand; it simply searches through each string sequence and returns those that contain the provided sub-sequence. While this is useful, there are a lot of cases in which this doesn’t help much, due to there not being a single self-contained string like this to search for. For example, what if we were searching for any numeric integer in a body of text? For this, this standard approach doesn’t work. This is where the special characters and patterns of Regular Expressions (see table above) can really shine, since they don’t require knowledge of the exact string in question to function, but rather only need the pattern that the sequence of characters being looked for abides by. A few simple examples demonstrating this are shown below.
# Can use square brackets searching for inclusion among a set of values
grep("[0-9]", strOptions, value = TRUE)
## [1] "1 cat" "2 dogs" "3 cats"
# Can again use square brackets searching for inclusion among a smaller set of values
grep("[12]", strOptions, value = TRUE)
## [1] "1 cat" "2 dogs"
# Alternatively, we can use the OR operator for larger strings
grep("cat|horse", strOptions, value = TRUE)
## [1] "1 cat" "3 cats" "four horses"
Now, let’s consider a slightly more interesting case, where the text strings are longer and the filter we’re trying to use is more complex.
strOptions <- sentences[1:5]
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
Suppose that we want to keep only the sentences that contain at least one letter that is sequentially duplicated in the sentence (e.g. ‘ll’ or ‘oo’). The value of the duplicated character does not matter in this case, preventing us from searching for an exact match. Thus, we need to make use of the wildcard (because the exact character does not matter) and a backreference (because we want that specific previous value to occur twice).
# Filter for sentences that contain a sequentially duplicated character.
grep("(.)\\1", strOptions, value = TRUE)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
String Pattern Manipulation
Beyond basic filtering approaches, we can also use Regular Expressions to manipulate the contents of a set containing many bodies of text. This is identical to the ‘Find and Replace’ functionality that you likely have made use of in text editors (Word, Outlook, Notepad, etc.). However, by using the search/filter capabilities of Regular Expressions, manipulations and modifications can be made not only on the basis of exact string sequences, but also string patterns!
numStrs <- c("1234568", "12", "1234", "98765432")
# NOTE: The gsub function is the 'string-search-and-replace' function in R.
# It works like so: gsub(SearchPattern, ReplacementPattern, Text) -> ModifiedText
# Format String Representations of large integers
stri_reverse(gsub("([0-9]{3})", "\\1,", stri_reverse(numStrs)))
## [1] "1,234,568" "12" "1,234" "98,765,432"
# Remove nonsense string found within two tags (e.g. like in HTML or XML data files)
smpErrTag <- "<tag id='ID123' style='margin-right:0px' class='entry'>lkkjfasddslkj</tag>"
gsub("(<tag id='ID123' .*>).*</tag>", "\\1Corrected Tag Input</tag>", smpErrTag)
## [1] "<tag id='ID123' style='margin-right:0px' class='entry'>Corrected Tag Input</tag>"
Two additional special characters that will prove to be particularly useful are the ^ (caret) and the \b symbols. When placed inside square brackets, the ^ is interpreted as NOT, while the \b defines what is known as a ‘word boundary’. Word boundaries are effective whenever it is necessary to delineate the edges of a given word from the spaces surrounding it (something that arises all the time). An example of these word boundaries in action is shown below.
# Remove all words with 3 or fewer characters
gsub("\\b[a-zA-Z]{1,3}\\b ?", "", sentences[1:2])
## [1] "birch canoe slid smooth planks." "Glue sheet dark blue background."
Notice that this removed all of the extraneous prepositions and definite/indefinite articles from the two sentences. If we were interested in identifying the topic of a body of text or whether the general contents of one body of text are similar to others, then removing the parts that are unnecessary for the comparison or calculation (like definite/indefinite articles and prepositions) is usually a very effective strategy when it comes to improving accuracy (of a Machine Learning Model). Thus, there are certain cases where such an odd data preparation method (i.e. removing all words with 3 or fewer characters) could actually be warranted. That is not to say that this blunt approach using Regular Expressions is the only way to do this, or even the best (as it plainly isn’t); despite this, it is by far the quickest and most simple.
Conclusion
In conclusion, Regular Expressions are a powerful tool when dealing with large bodies of text, whether you are an analyst trying to clean up data being fed into a Machine Learning Model, a student looking for a particular passage/example in an electronic textbook/PDF/EBook, or just a computer-savvy individual searching for certain text-patterns in a set of emails or a particularly large Microsoft Word document. Arguably their greatest strength over alternative techniques is their simple construction (no coding experience required) and the fact that nearly every search bar to the most popular computer tools and applications around are able to use and interpret Regular Expressions (though the ‘flavors’ that each app uses may vary). So the next time you pull up Ctrl-F on a Word Document, consider using Regular Expressions to make your next ‘find’ or ‘find-and-replace’ all the more exact and precise!
## R Session Information:
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.3.1 dplyr_0.8.5 plyr_1.8.4 kableExtra_1.3.1
## [5] knitr_1.20 stringi_1.1.7 stringr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 highr_0.7 cellranger_1.1.0 compiler_3.5.1
## [5] pillar_1.4.4 rmdformats_1.0.0 base64enc_0.1-3 tools_3.5.1
## [9] digest_0.6.15 jsonlite_1.6 evaluate_0.14 tibble_3.0.1
## [13] lifecycle_0.2.0 viridisLite_0.3.0 pkgconfig_2.0.2 rlang_0.4.5
## [17] rstudioapi_0.11 yaml_2.2.0 xfun_0.13 httr_1.4.1
## [21] xml2_1.3.2 vctrs_0.2.4 rprojroot_1.3-2 webshot_0.5.2
## [25] tidyselect_1.0.0 glue_1.4.0 R6_2.2.2 rmarkdown_1.10
## [29] bookdown_0.21 purrr_0.3.4 magrittr_1.5 backports_1.1.2
## [33] scales_1.0.0 htmltools_0.3.6 ellipsis_0.3.0 assertthat_0.2.0
## [37] rvest_0.3.6 colorspace_1.3-2 munsell_0.5.0 crayon_1.3.4
---
title: "Regular Expressions"
author: "Michael O'Connor"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output: 
  rmdformats::readthedown:
    code_download: yes
    df_print: paged
    highlight: haddock
    thumbnails: false
    lightbox: false
  pdf_document:
    latex_engine: xelatex
geometry: margin=1in
urlcolor: 'blue'

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.height = 4, fig.width = 8)
library(stringr)
library(stringi)
library(knitr)
library(kableExtra)
library(plyr)
library(dplyr)
library(readxl)
```

Think back to the last time that you opened your Email Inbox because you were looking for a specific email that you received last year. You know that the sender's first name was 'Jason' and that the email in question revolved around the topic of SCADA data, but not much more than that. What's more, there are 20,000 emails in your inbox, so it's not like you can go through them one by one. What do you do? Most likely you would look for a search bar in the Email Application you are using and search for 'From:Jason received:last year contents:SCADA'. With any luck, this will narrow down the number of emails that you need to search through to a far more reasonable number. This is an exceptionally powerful and useful feature, and it is worth paying attention to, because there are many cases that don't provide such a nice pre-built search tool. Suppose, in a real example, that you have an CSV file with 20,000 rows that contains free-form text feedback in one of the columns, and you need to group each text response into one of five possible categories. What do you do?

1. Fully read through each (sometimes multi-sentence) response and assign each to the correct category
2. Skim through each response (i.e. first phrase or first sentence) and assign each to a given category 
3. Something else

When dealing with situations like this, it is important to acknowledge that reading through each and every response is really not a reasonable solution. Not only would this take an excruciatingly long amount of time, it will also likely be fraught with mistakes that you invariably will make as you grow tired, annoyed, and/or bored. You're also highly likely to produce many mistakes by skimming (not to mention this would still take an exceptionally long amount of time to do). Instead, it is far more efficient to do something else; specifically, to filter the list of responses according to certain criteria that you could use to associate certain bodies of text to a given category (just like you filtered for a given Email in the previous example). This kind of filtering/manipulation can be easily accomplished using **Regular Expressions**. 


So, what are **Regular Expressions**? Regular Expressions (often referred to as regex, or rational expressions) are specific character sequence patterns that define a search pattern that string-searching algorithms can understand. What's more, most applications that implement some form of a search bar makes use of at least one flavor of Regular Expressions. They are also highly versatile, and can be used to filter, manipulate, and even extract specific pieces of information from larger bodies of text. Over the course of this document, we shall introduce the basic concepts behind Regular Expressions, and work through a number of test-cases showcasing how they work and how they can be applied in many different scenarios.

## Basic Concepts

Regular Expressions can be used to specify/identify a certain set or pattern of characters for the purpose of filtering, manipulation, or extraction. This can be thought of as a set of Querying rules that can be used on a body of text, similar to how one can query a more structured/standard dataset using SQL, inequalities, and the like (e.g. filter out all records whose x-variable is less than 4). The (basic) building blocks for these string query-patterns are laid out below in the following table. It should be noted that what follows is not an exhaustive list; for more comprehensive details, see [Regular-Expressions.info](https://www.regular-expressions.info/), though I should note the the full content spans more than 300 pages when collapsed to PDF (just a warning - it also covers the many different `flavors' of Regular Expression syntax, a topic that I will not cover here, short of saying you can think of them as different dialects).


```{r, echo=FALSE, results='asis'}
rules <- data.frame(
  Operation = c("Boolean 'OR'", "Grouping (Op Precedence)", "Bracketed Inclusion", "Wildcard", "Possible Existence", "Arbitrary Repetition", "Mandatory Repetition", "Repetition Range", "Backreference"),
  `Pattern` = c("|", "(...)", "[...]", ".", "?", "*", "+", "{n,m}", "\\\\n"),
  Meaning = c("Standard OR operation separating possible string patterns.", "Used to mark/group together and define the scope of certain operations (i.e. as in algebra)", "Single character match contained within the brackets. Allows the use of '-' to specify ranges, like [0-9] or [a-z].", "The '.' represents a wildcard that could match ANY character", "The '?' indicates that the preceding element occurs zero or one time in the search pattern.", "The '*' indicates that the preceding element occurs zero or more times in the search pattern.", "The '+' indicates that the preceding element occurs one or more times in the search pattern.", "The {n,m} indicates that the preceding element occurs at least n times and at most m times in the search pattern.", "The '\\\\n' is replaced with the contents of the n-th marked subexpression (only valid for n between 0 and 9)"),
  Example = c("'cat|dog'", "'mea(n|t)s'", "'T[0-4]'", "'Test .A'", "'colou?rs?'", "'ab*c'", "'4+'", "'={4,6}'", "([a-z]+) \\\\1"),
  `Returns/Matches` = c("'cat', 'dog'", "'means', 'meats'", "'T0', 'T1', 'T2', 'T3', 'T4'", "'Test AA', 'Test 1A', 'Test aA', ...", "'color', 'colour', 'colors', 'colours'", "'ac', 'abc', 'abbc', 'abbbc', ...", "'4', '44', '444', '4444', ...", "'====', '=====', '======'", "Any text with a duplicated substring."),
  check.names = FALSE
)
kable(rules, format = "markdown", align = "lcllr", escape = TRUE)
```




## String Pattern Filtering




```{r}
strOptions <- c("1 cat", "2 dogs", "3 cats", "four horses")
# NOTE: The grep function is basically the 'string-search' function in R. 
# It is called in the following way: grep(SearchPattern, Text) -> Matches
# We'll use it to filter out all the options without the string sequence 'cat'.
grep("cat", strOptions, value = TRUE)
```

The standard approach to string literal filtering/comparison is arguably the simplest to understand; it simply searches through each string sequence and returns those that contain the provided sub-sequence. While this is useful, there are a lot of cases in which this doesn't help much, due to there not being a single self-contained string like this to search for. For example, what if we were searching for any numeric integer in a body of text? For this, this standard approach doesn't work. This is where the special characters and patterns of Regular Expressions (see table above) can really shine, since they don't require knowledge of the exact string in question to function, but rather only need the pattern that the sequence of characters being looked for abides by. A few simple examples demonstrating this are shown below. 

```{r}
# Can use square brackets searching for inclusion among a set of values
grep("[0-9]", strOptions, value = TRUE)
# Can again use square brackets searching for inclusion among a smaller set of values
grep("[12]", strOptions, value = TRUE)
# Alternatively, we can use the OR operator for larger strings
grep("cat|horse", strOptions, value = TRUE)
```



Now, let's consider a slightly more interesting case, where the text strings are longer and the filter we're trying to use is more complex.




```{r, echo=1}
strOptions <- sentences[1:5]
strOptions
```

Suppose that we want to keep only the sentences that contain at least one letter that is sequentially duplicated in the sentence (e.g. 'll' or 'oo'). The value of the duplicated character does not matter in this case, preventing us from searching for an exact match. Thus, we need to make use of the wildcard (because the exact character does not matter) and a backreference (because we want that specific previous value to occur twice). 

```{r}
# Filter for sentences that contain a sequentially duplicated character.
grep("(.)\\1", strOptions, value = TRUE)
```




## String Pattern Manipulation


Beyond basic filtering approaches, we can also use Regular Expressions to manipulate the contents of a set containing many bodies of text. This is identical to the 'Find and Replace' functionality that you likely have made use of in text editors (Word, Outlook, Notepad, etc.). However, by using the search/filter capabilities of Regular Expressions, manipulations and modifications can be made not only on the basis of exact string sequences, but also string patterns!

```{r}
numStrs <- c("1234568", "12", "1234", "98765432")
# NOTE: The gsub function is the 'string-search-and-replace' function in R. 
# It works like so: gsub(SearchPattern, ReplacementPattern, Text) -> ModifiedText
# Format String Representations of large integers 
stri_reverse(gsub("([0-9]{3})", "\\1,", stri_reverse(numStrs)))

# Remove nonsense string found within two tags (e.g. like in HTML or XML data files)
smpErrTag <- "<tag id='ID123' style='margin-right:0px' class='entry'>lkkjfasddslkj</tag>"
gsub("(<tag id='ID123' .*>).*</tag>", "\\1Corrected Tag Input</tag>", smpErrTag)
```

Two additional special characters that will prove to be particularly useful are the ^ (caret) and the \\b symbols. When placed inside square brackets, the ^ is interpreted as NOT, while the \\b defines what is known as a 'word boundary'. Word boundaries are effective whenever it is necessary to delineate the edges of a given word from the spaces surrounding it (something that arises all the time). An example of these word boundaries in action is shown below.



```{r}
# Remove all words with 3 or fewer characters
gsub("\\b[a-zA-Z]{1,3}\\b ?", "", sentences[1:2])
```

Notice that this removed all of the extraneous prepositions and definite/indefinite articles from the two sentences. If we were interested in identifying the topic of a body of text or whether the general contents of one body of text are similar to others, then removing the parts that are unnecessary for the comparison or calculation (like definite/indefinite articles and prepositions) is usually a very effective strategy when it comes to improving accuracy (of a Machine Learning Model). Thus, there are certain cases where such an odd data preparation method (i.e. removing all words with 3 or fewer characters) could actually be warranted. That is not to say that this blunt approach using Regular Expressions is the only way to do this, or even the best (as it plainly isn't); despite this, it is by far the quickest and most simple. 







## String Pattern Extraction

Arguably the most powerful aspect of Regular Expressions is how effectively they can be used to extract certain key pieces of information from a larger body of text, or a large set of bodies of text. Both of these applications are particularly relevant for all sorts of things in the context of Data Science, ranging from Feature Engineering and Machine Learning to simple Data Curation. 


As a simple example, consider a series of sentences stored in some form that contain numerical values of something that you really need for some arbitrary reason. It might be reasonable to simply read through those sentences if there aren't that many of them (like 10), but what do you do if there are 100,000? Well, with Regular Expressions, that isn't a problem! 


```{r, echo=FALSE}
numbers <- strsplit("one|two|three|four|five|six|seven|eight|nine|ten", split = "|", fixed = TRUE)[[1]]
snts <- grep(paste0("\\b", paste0(numbers, collapse = "\\b|\\b"), "\\b"), sentences, value = TRUE)
snts1 <- unlist(Filter(length, lapply(snts, function(a) {
  for (i in 1:length(numbers)) {
    if (grepl(numbers[i], a)) {
      tphrase <- str_extract_all(a, sprintf("%s [a-z]+", numbers[i]))[[1]]
      if (length(tphrase) == 0) return (c())
      if (str_length(strsplit(tphrase, " ")[[1]][2]) < 3) return(c())
      if (strsplit(tphrase, " ")[[1]][2] %in% c("distinct", "met", "comes", "than", "when", "kinds", "batches", "costs")) return(c())
      #print(tphrase)
    }
    a <- gsub(paste0("\\b", numbers[i], "\\b"), i, a)
  }
  a
})))
exmpl <- snts1[1:10]
```

```{r}
# Extract the numerical values and the objects they refer to from a number of sentences
tibble(data.frame(Sentence = exmpl, Extract = str_extract(exmpl, "[0-9]+ [a-z]+"))) %>%
  mutate(Number = as.numeric(gsub(" .*", "", Extract)), Content = gsub(".* ", "", Extract))
```

Using Regular Expressions, we were immediately able to extract out just the relevant information that we desired from a larger body of text. After doing so, we could then reformat the numbers to run any kind of statistical analysis or calculation that we'd like, something that was previously not possible when all the numbers were hidden/encoded in the form of a phrase found within the larger body of text.



## Conclusion

In conclusion, Regular Expressions are a powerful tool when dealing with large bodies of text, whether you are an analyst trying to clean up data being fed into a Machine Learning Model, a student looking for a particular passage/example in an electronic textbook/PDF/EBook, or just a computer-savvy individual searching for certain text-patterns in a set of emails or a particularly large Microsoft Word document. Arguably their greatest strength over alternative techniques is their simple construction (no coding experience required) and the fact that nearly every search bar to the most popular computer tools and applications around are able to use and interpret Regular Expressions (though the 'flavors' that each app uses may vary). So the next time you pull up Ctrl-F on a Word Document, consider using Regular Expressions to make your next 'find' or 'find-and-replace' all the more exact and precise!



```{r, results='hold', echo=FALSE}
options(width = 80)
cat("R Session Information:\n")
sessionInfo()
```



