Loading the stringr library
library(stringr)
In this example, the pattern we are matching is a sequence of all digits within the 0-9 range, quantified by the “+” regular expression, and a literal match on the dollar sign at the edge.
container1 <- c("4425$", "1113", "335A", "$11132", "23 35$", " 35 $352 035$")
unlist(str_match_all(container1, "[0-9]+\\$"))
## [1] "4425$" "35$" "035$"
In this example, the pattern we are matching is a sequence of bound, lower case letters with a quantifier. This quantifier is matched at least one time and and no more than four times.
container2 <- c("hello", "Firs", "plot", "pear", "AAAA")
unlist(str_extract_all(container2, "\\b[a-z]{1,4}\\b"))
## [1] "plot" "pear"
In this example, the pattern we are matching is an optional sequence of characters that end in “.txt” marked by an “$”.
container3 <- c("fileloc.txt", ".txt", "txt", "location .txt", ".text location", "the lost arc of.txt")
unlist(str_extract_all(container3, ".*?\\.txt$"))
## [1] "fileloc.txt" ".txt" "location .txt"
## [4] "the lost arc of.txt"
In this example the pattern we are matching is: a)two digits followed by a forward slash b) another two digits followed by a forward slash c) four digits. This is accomplished by using the selected symbol and a quantifier. This pattern will only match strings with that very specific structure, including forward slashes.
container4 <- c("01/12/1598", "01\12\1598", "01/12/1999", "1992/12/19", "02.12.2001")
unlist(str_extract_all(container4, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "01/12/1598" "01/12/1999"
container5 <-c("<test> hello </test>", "<body> header and print </body>", "<test> .. </test>", "<test> hello </test1>")
unlist(str_extract_all(container5, "<(.+?)>.+?</\\1>"))
## [1] "<test> hello </test>" "<body> header and print </body>"
## [3] "<test> .. </test>"
The first step is to replace the [0-9] range with the predefined character class [[:digit:]]. Then replace the “+” quantifier with another quantifier In the right hand section, we can replace the literal match on $ with word edge and a character class that includes $. This should give the same output as the previous example.
unlist(str_match_all(container1, "[[:digit:]]*\\b[$]"))
## [1] "4425$" "35$" "035$"
# step1
email <- "chunkylover53[at]aol[dot].com"
# step2
email <- str_replace(email, pattern = "\\[at\\]", replacement = "@")
#print 1st replacement
email
## [1] "chunkylover53@aol[dot].com"
# step3
email <- str_replace(email, pattern = "\\[dot\\].", replacement = ".")
#print 2nd replacement
email
## [1] "chunkylover53@aol.com"
email <- "chunkylover53[at]aol[dot].com"
# First attemp using the current expression fails because we do not know if the extracted digits are part of a sequence. The five could have been in the front and the three could have been at the end of the email address.
ext <- str_extract_all(email, "[[:digit:]]")
ext
## [[1]]
## [1] "5" "3"
#To fix the issue, we need to add the quantifier "+" to make sure that the preceding items are matched one or more times before the next sequence is created.
ext <- str_extract_all(email, "[[:digit:]]+")
ext
## [[1]]
## [1] "53"
email <- "chunkylover53[at]aol[dot].com"
# This expression fails because it is using the special symbol for no digits. This means that it will extract any string that is not a digit.
ext <- str_extract_all(email, "\\D")
ext
## [[1]]
## [1] "c" "h" "u" "n" "k" "y" "l" "o" "v" "e" "r" "[" "a" "t" "]" "a" "o"
## [18] "l" "[" "d" "o" "t" "]" "." "c" "o" "m"
# To fix this issue, we use the lower case special symbol "\d"" and the "+" quantifier for the same reasons as the previous example. This should give us the correct output.
ext <- str_extract_all(email, "\\d+")
ext
## [[1]]
## [1] "53"
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.