Main functions for Regular Expressions in R

This is a summary of the main functions used with regular expresions.

1. `grep()` and `grepl()`

search for matches of a regular expression/pattern in a character vector.

- grep() return the indices into the character vector that match, the strings that happen to match

prueba <- c("a", "a", "a", "b", "b", "c", "c")
grep("a", prueba)

## [1] 1 2 3

# search which elements in the vector 'prueba' matches the pattern 'a'and
# return the indices

length(grep("a", prueba))

## [1] 3

# we can add functions to grep, there are three elements in the 'prueba'
# vector that matches the pattern 'a'

- grepl() returns a TRUE/FALSE vector with the same length of the riginal vector indicating which elements match.

grepl("a", prueba)

## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

- setdiff() compare two vectors and returns the values that are differents

prueba2 <- c("a", "a", "z", "x", "b", "c", "c")  #i changed the 2nd and the 3rd element
setdiff(prueba, prueba2)  #takes the first vector 'prueba' and substract all the elements in the second vector'prueba2' that are the same

## character(0)

# all the elements in prueba are also in prueba2
setdiff(prueba2, prueba)

## [1] "z" "x"

# the 2nd and the 3rd (z and x) value from 'prueba2' are not in 'prueba'

# e.g. if we grep on a vector before passing the function setdiff, as far
# as the grep function returns a vector of indices, we will get which of
# those indices are diffetrents (the position)

neither grep() or grepl() tell you exactly where the match occurs or what the match is (for more complicated vectors)

2. `regexpr()` and `gregexpr()`

search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match - regexpr() only gives you the first match in each element of the vector

prueba3 <- c("aaa", "adc", "baa", "ahs", "jdn", "ccc")
regexpr("a", prueba3)

## [1]  1  1  2  1 -1 -1
## attr(,"match.length")
## [1]  1  1  1  1 -1 -1
## attr(,"useBytes")
## [1] TRUE

# the first vector gives the position of the match (where is the 'a') for
# every element of the vector the second vector gives you the length of
# the match (remember the regexpr only ook for the first match in each
# element)

- gregexpr() returns the indices and the length for all the matches in each element of the character vector. using a regular expression, it matches as much as it could (if I say, the regular expression finishes qith an 'a', and ther is an 'a' forward in the text, and regular expression matches, it's going to return the longest possibility) In thhese cases it es necessary to make lazy the pattern, writing whatever? (.*?) instead of whatever (.*)

gregexpr("a", prueba3)

## [[1]]
## [1] 1 2 3
## attr(,"match.length")
## [1] 1 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 2 3
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## [1] 1
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[5]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[6]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE

# for each elemnt of the vector, it returns in which position are all the
# matches and the length

- substr() knowing this, you can use the substr() function; given a string and the character where to start extracting and the length of characters you want to extract.

substr(prueba3[1], 1, 3)  #in R, we start counting in 1 and the 3rd argument `stop` includes that number.

## [1] "aaa"

If we had a match which length was three:

regexpr("aaa", prueba3)

## [1]  1 -1 -1 -1 -1 -1
## attr(,"match.length")
## [1]  3 -1 -1 -1 -1 -1
## attr(,"useBytes")
## [1] TRUE

substr(prueba3[1], 1, 1 + 3 - 1)  #from the first element, substring starting in the firs charachter, and stoping after three (the length of the match) minus one, as far as it also extracts the character in which starts. (if we do not especify the -1, we will get 4 characters )

## [1] "aaa"

- regmatches()

the result it´s similar to substr(), it gives you the characters that matches the pattern, but you don't have to especify neither were to start or where to finish, you just pass the objecto associated to a regexpr() function as an argument.

a <- regexpr("aaa", prueba3)
regmatches(prueba3, a)

## [1] "aaa"

c <- gregexpr("c", prueba3)
regmatches(prueba3, c)

## [[1]]
## character(0)
## 
## [[2]]
## [1] "c"
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "c" "c" "c"

3. `regexec()`

it works like regexpr() except it gives you the indices and the for parenthesized sub-expressions regular (maybe is what you want to use)

prueba4 <- "his salary is 50,000 dollars per year"
regexec("his salary is (.*) dollars per year", prueba4)

## [[1]]
## [1]  1 15
## attr(,"match.length")
## [1] 37  6

# the match with the complete regular expressions starts in the 1st
# character and is 37 characters long. the match of the parenthesized
# sub-expression`(.*)` starts in the 15th character and is 6 characters
# long.

regexec("his salary is .* dollars per year", prueba4)

## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 37

# if we take out the parethesis, we just get where the match for the
# entire regular expression starts and its length

It can be also combined with the substr() and the regmatches() functions

substr(prueba4, 1, 1 + 37 - 1)  #i got the entire match

## [1] "his salary is 50,000 dollars per year"

substr(prueba4, 15, 15 + 6 - 1)  #i got only the salary using the values (character starts and length) of the parenthesized sub-expression

## [1] "50,000"

a <- regexec("his salary is (.*) dollars per year", prueba4)
b <- regmatches(prueba4, a)  #gives me list that contains two characcter vectors, one for the whole string and another for the 'salary'
salary <- sapply(b, function(x) x[2])  #use the `sapply()` function to get the differents elements of the list, to extract the 2nd element
print(salary)

## [1] "50,000"

4. `sub()` and `gsub()`

both functions replace a string for another in a given vector. - sub() replaces only the first match it founds on each element of the vector

sub("a", "2", prueba3)  #finds the first 'a' in each element within the vector 'prueba3' and and replaces it for '2'

## [1] "2aa" "2dc" "b2a" "2hs" "jdn" "ccc"

- gsub() replaces all the matches it founds on each element of the vector

gsub("a", "2", prueba3)  #finds all the 'a' in each element within the vector 'prueba3' and and replaces it for '2'

## [1] "222" "2dc" "b22" "2hs" "jdn" "ccc"

Main functions for Regular Expressions in R

This is a summary of the main functions used with regular expresions.

1. grep() and grepl()

2. regexpr() and gregexpr()

3. regexec()

4. sub() and gsub()

1. `grep()` and `grepl()`

2. `regexpr()` and `gregexpr()`

3. `regexec()`

4. `sub()` and `gsub()`