*Working in progress

Regular expressions are an extremely powerful tool which allow one to look for and match specific patterns in a string (or string vector)

Any use of regular expressions involves at least two key parameters. Firstly, the pattern parameter, which defines what type and/or sequence of characters to look for; and secondly the text parameter, which should identify the string content against which to match this pattern. The pattern itself is always provided in the form of a string so will need to be enclosed in (single or double) quotation marks.

swirl::install_course("Regular Expressions")
library(swirl)
library(stringr)

swirl() # Course about regular expressions

# data base that came form stringr
sentences
fruit
words

sentences %>% # test database
  enframe() %>% 
  mutate(new = str_replace(string = value, pattern = "^The", "other"))%>%
  head()

Regex in base R

There are seven function: grep, grepl, regexpr, gregexpr, regexec, sub and gsub.

These all use overlapping and/or similar input parameters, and differ mainly in terms of functionality and the type of output they provide.

While grep, grepl, regexpr, gregexpr and regexec search for matches to the pattern argument within each element of a character vector; sub and gsub leverage regular expressions to facilitate ‘find and replace’

sent <- stringr::sentences %>% 
  head(100)

# grep returns a vector of indices indicating which elements of the 'text' vector match the provided 'pattern'.
grep("days", sent)

# Using grep with value = TRUE, returns the matching elements of the 'text' vector.
grep("days", sent, value = T)

# grepl is very similar to grep, but instead returns logical output i.e. it returns TRUE or FALSE for each element of the 'text' vector, indicating whether or not the element matches the provided pattern. 

grepl("days", sent)

# regexpr will return an integer vector of the same length as the 'text'  vector, with the number indicating the starting position of the first  pattern match within each element (or -1 if there is no match

regexpr("the", sent)

# gregexpr is similar to regexpr, except instead of returning an integer  vector it will return a list, with each list element indicating the starting positions of all pattern matches within the respective element of the text (or -1 if there is no match)

gregexpr("the", sent)

# regexec is similar to gregexp in that it also returns a list of the same length as the 'text' parameter, with each element either -1 if there is no match, or a sequence of integers with the starting positions of the match and substrings corresponding to parenthesized subexpressions of the 'pattern' parameter. This means that with pattern = "awe(some)" and text ="awesome", you would get back a list containing c(1,4) along with the match.length attributes c(7,4) reflecting that "awesome" is matched at position 1, and "some" is matched at position 4.
regexec("day(s)", sent)

# sub can be used to replace the first match of a regular expression in each element of a character vector. As before, you'll need to provide a 'pattern' argument in the form of a regular expression and a 'text' vector x. In addition, you'll need to provide a 'replacement' string.

sub("days", "night", sent)

# gsub is just like sub except it will replace all matches (not just the  first) in each element of a character vector.
gsub("days", "night", sent)

Anchors

allow one to specify where within the string a particular pattern should be matched

Examples: “^” to match at the start of a string, and “$” to match at the end of a string “” is used to match a word boundary - it can be used to match an empty space on either end of a word. Note that R treats backslashes as escape values for character constants (in addition to regular expressions which also do so), so when supplying “” you’ll need to escape it with another backslash i.e. “\b”

“” is used to match a non-word boundary - it is precisely the opposite | of “”. Usage as “\B”

All the Anchors goe within the quotes

sentences %>% 
  enframe() %>% 
  mutate(new = str_replace(string = value, pattern = "^The", "other"))%>%
  head()

sentences %>% 
  enframe() %>% 
  mutate(new = str_replace(string = value, pattern = "punch.$", "other"))%>% # In case of end point, it have to be on the regex
  head()

Character Classes

uses “[:" and “:]” around a predefined name inside square brackets; and the “" along with a special character (which is, in fact,”\d“). a non-digit character with”" (which is, in fact, “\D”) “[:alpha:]” == “[[:lower:][:upper:]]” or “[[A-Z][a-z]]” - * Note brackets inside brackets

Note that in order to combine (union) character classes of the form “[:…:]” within a pattern they need to be enclosed in square brackets e.g. “[[:lower:][:upper:]]” as above.

Patterns like “[a-z]” however can be combined together without additional square brackets, so that “[[A-Z][0-9]]” and “[A-Z0-9]” are equivalent.

By contrast “[A-Z][0-9]” is not equivalent to these since it is a two character pattern.

“[:alnum:]” matches any alphanumeric characters - in other words alphabetic characters plus digit characters. equivalent to: “[[:alpha:][:digit:]]” or “[A-z0-9]”

“[A-z]” is egual to “[[A-Z][a-z]]”.

Dot “.” - Matches any character except linebreaks.

In many cases these are interchangeable

grepl("[:digit:]", sentences) %>% head()

# start with a non-digit character
grepl("^\\D", sentences) %>% head()

# begin with a lower-case character
grepl("^[:lower:]", sentences) %>% head()

#end with an upper-case character
grepl("[:upper:]$", sentences) %>% head()

Ranges [ ] and Groups ( )

The key metacharacters here are square and round brackets - “[“,”]”, “(” and “)”

Square brackets are used for ranges within a regular expression,

and round brackets (or parentheses) are used to create groups within a regular expression.

Square brackets are used to define a set or range of characters, where one (or more) must usually be matched in the text. For example, the pattern “[abc]” will match any text which contains either an “a” or a “b” or “c” (as opposed to the pattern “abc” which would only match text containing all three letters “abc” adjacent to each other and in that order at some point in the string).

Ranges can also be specified using a “-”. The pattern “[a-c1-3]” is equivalent to “[abc123]”

to match “xxxxyyyyyzzzzz” -> “[yz]”

A range can be negated by including a “^” at the beginning “[^abc]” will match any character other than “a”, “b” or “c”

“item_[0-1][1-9][a-z]” matches strings like “item_01a”, “item_10b”, “item_19c”

Groups () “item_([0-1][1-9])([a-z])” is entirely equivalent, except it also captures the the item number (e.g. ‘01’) and item letter (e.g. ‘b’)

These groups can then be referenced within the match and/or have other regex operators applied to them. In R, the captured groups can be referenced with ‘\1’, ‘\2’ up to ‘\9’

sub(pattern = "item_([0-1][1-9])([a-z])", replacement = "item_num_\\1_sec_\\2", x = "item_02c")
item_num_02_sec_c

Quantifiers

used to indicate characters or collections of characters which may be repeated in the search pattern; match a certain quantity of the character or subexpression immediately to its left - always inside the quotes like “a+”, and outside a parentheses to repeti a group “(abc)+abx”,

Plus - In the case of a pattern such as “abc+abx”, the quantifier only applies to the character immediately to it’s left (i.e. the “c”) - so that this pattern would match “abcccccabx” but not “abcabcabx”. By contrast, in the case of a pattern like “(abc)+abx”, the quantifier applies to the sub-expression “(abc)” and thus this pattern would match “abcabcabx” but not “abcccccabx”.

Interrogation mark The “?” quantifier indicates a match of either 0 or 1 occurrences.

The pattern “Could be 10? or 20?” would match: “Could be 1 or 2”; “Could be 10 or 20”;

The star “*“” used to match 0 or more occurrences of a character or sub-expression. The pattern “xy*z" will matche: “xz”, “xyz”, “xyyz”, “xyyyz”, and “xyyyyz”

{} The form “(sub){x,y}” indicates between x and y occurrences of the sub-expression (sub)

For example, “a{2,4}” will match any string containing “aa”, “aaa” or “aaaa”.

“a{2,}” will match strings containing “aa”, “aaa”, “aaaa”, “aaaaa” etc.

“(sub){x}” will match exactly x occurrences

Everthing combined

“^+$" start of string (anchor) - "^", one or more digit characters (character class & quantifier) - "\d+", end of string (anchor) - "$”; the pattern represents any Positive Integer

“^-?+$” represents any Positive or Negative Integer

“^-.?+$” represents any negative number

# In the code in r, we have to scape: "\d" become "\\d"
grepl("^\\d+$", seq(-10, 10, 1))

grepl("^-?\\d+$", seq(-10, 10, 1))

grepl("^-\\d*\\.?\\d+$", seq(-10, 10, 1))

¹{3,16}$ Will match between 3 and 16 characters in length, containing only lower case letters, numbers, dashes and underscores

Advantage quatifiers

There are actually three different ‘classes’ of quantifiers based on the underlying pattern evaluation paradig:

greedy quantifiers: tries to get the longest match that is possible for the specified input string.

The pattern “.*" means - get any string until the end or repet any character always. Example: The pattern “.*foo" means: find any follwed by the three letters “f”, “o”, “o”

Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)

It is like: starts from the end and match anythig.

? The lazy lazy (also known as reluctant or non-greedy) quantifiers: tries to get the shortest match. A quantifier is specified as lazy by appending a “?” to it

“.*?" is the ‘lazy’ version of “.*"

The + and possessive quantifiers: tries once for a match A quantifier is specified as possessive by appending a “+” to it e.g. “.*+" is the ‘possessive’ version of “.*" + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

What match(es) will be returned when the pattern “.*foo" is applied to the string “xfooxxxxxxfoo”.

The first example uses the greedy quantifier .* to find “anything”, zero or more times, followed by the letters “f” “o” “o”. Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. That means: starts from the end. At this point, the overall expression cannot succeed, because the last three letters (“f” “o” “o”) have already been consumed. So the matcher slowly backs off one letter at a time until the rightmost occurrence of “foo” has been regurgitated, at which point the match succeeds and the search ends.

Package strigr

O primeiro argumento da função é sempre uma string ou um vetor de strings.

Funções simples: sem o argumento pattern

str_length

str_length("São Paulo")
## [1] 9
str_length(c("São Paulo", "Rio de Janeiro", 
             "Rio Grande do Norte", "Acre"))
# differ from 
length(c("São Paulo", "Rio de Janeiro", 
             "Rio Grande do Norte", "Acre"))

str_trim

s <- c("M", "F", "F", " M", " F ", "M")
as.factor(s)

string_aparada <- str_trim(s)
as.factor(string_aparada)

str_sub

s <- c("01-Feminino", "02-Masculino", "03-Indefinido")  # pegar do quarto até o último caractere
str_sub(s, start = 4) # pegar apenas os dois primeiros caracteres

str_sub(s, end = 2) # pegar apenas os dois primeiros caracteres
## [1] "01" "02" "03"

s <- c("Feminino-01", "Masculino-02", "Indefinido-03")
str_sub(s, end = -4)
## [1] "Feminino"   "Masculino"  "Indefinido"
str_sub(s, start = -2)
## [1] "01" "02" "03"

s <- c("Feminino-01", "Masculino-02", "Indefinido-03")
str_sub(s, end = -4)
## [1] "Feminino"   "Masculino"  "Indefinido"
str_sub(s, start = -2)
## [1] "01" "02" "03"

s <- c("__SP__", "__MG__", "__RJ__")
str_sub(s, 3, 4)

strc_c

variaveis <- names(mtcars)
variaveis

variaveis_explicativas <- str_c(variaveis[-1], collapse = " + ")

Funções com argumento pattern

str_detect

str_detect("sao paulo", pattern = "paulo$")
## [1] TRUE
str_detect("sao paulo sp", pattern = "paulo$")

Regex

‘^ban’ reconhece apenas o que começa exatamente com “ban”. ‘b ?an’ reconhece tudo que tenha “ban”, com ou sem espaço entre o “b” e o “a”. ‘ban’ reconhece tudo que tenha “ban”, mas não ignora case. BAN’ reconhece tudo que tenha “BAN”, mas não ignora case. ‘ban$’ reconhece apenas o que termina exatamente em “ban”

string <- c("abandonado", "ban", "banana", "BANANA", "ele levou ban", "pranab anderson")

grepl(pattern = "^ban", string)
grepl(pattern = "b ?an", string)
grepl(pattern = "b?an", string)
grepl(pattern = "ban", string) #reconhece em qlq lugar da palavra
grepl(pattern = "ban$", string)
grepl(pattern = "ban.$", string)
grepl(pattern = "ban.?$", string)

My Samples

d <- structure(list(value = c("           2019s/v282930ahead of print        ", 
"           2018s/v252627         ", "           2017s/v222324         ", 
"           2016s/v192021         ", "           2015s/v161718         ", 
"           2014s/v131415         ", "           2013s/v101112         ", 
"           2012s/v789         ", "           2011s/v56          "
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

With str_extract and regex

d %>%
  mutate(value = str_trim(value), 
         year = str_extract(value, "\\d{4}"),
         Number_1 = str_extract(value, "(?<=s/v)\\d{2}"),
         Number_2 = str_extract(value, "(?<=s/v\\d{2})\\d{2}"),
         Number_3 = str_extract(value, "(?<=s/v\\d{4})\\d{2}"))

With str_sub

text_table %>%
  mutate(value = str_trim(value), 
         year = str_sub(value, end = 4), 
         Number_1 = str_sub(value, start = 8, end = 9),
         Number_2 = str_sub(value, start = 10, end =11),
         Number_3 = str_sub(value, start = 12, end = 13))