Regular Expressions, RE, are useful when seeking text using wide patterns, using metacharacter which are the ‘grammar’ in the logic search and literals which are similar to ‘words’ used in Natural Language, NL.
# use ^Char s, to find "Char" & "s" at the start of a line
# use Char$, find "Char" at the end of a line
# use [Cc][Hh][Aa][Rr] to find "CHAR", "char", "Char", ...all versions
# Combing RE
#use ^[Cc][Hh][Aa][Rr] s to find all versions at the begining of a line
# use [0-9][a-zA-Z] to specify the range
#use [^?.]$ to match not in the indicated class that is neither "?" nor "."
#".", dot is the same as "any"
#"|", the pipe is the same as "or" so
# flood|fire will match either "flood" or "fire"
#^[Gg]ood [Bb]ad will match "Good" or "good" at the begining of the line and
#"Bad" or "bad" anywhere in the line.
# ^([Gg]ood [Bb]ad) will match "Good", "good", "Bad" or "bad" at the begining of the line.
#[Op]tion( [Aa][Ll]\.)? will match "Option", "option", "Optional" or "OptionAL", ...
# here "\", backslash is use to escape "." dot so it's the literal dot and "?" the questionmark is use for "any"
#(.*), any character repeated any number of times, within "()" parentheses
#[0-9]+ (.*)[0-9]+, the "+" plus sign means at least one of the items (a number in this case) followed by any number of characters, followed by at least one of the items (a number in this case)
#[Tt]rump( + [^ ]+ + {1,5}) tweet will find "Trump" or "trump", anywhere in a line, followed by a space and something that is not a space followed by at least one space, between one to five times, and ends with "tweet"
#{m,} {m} "m," at least m matches, "m" exactly m matches
#"() +\1," is used to match what is in the parentheses, so
# +([a-zA-Z]+) +\1 + wil match, a space, followed by at least one or more characters, followed by at least one space, followed by the exact same match found by the " +([a-zA-Z]+)" expression, so it will find repeated expressions.
#"*", is a greedy search, so
#"^s(.*)s" will match the longest poosible regular expression which ends with "s"
#"^s(.*?)s$", the question mar, "?" tuns off the greedy ".*" search, so it will find a smaller, not the maximum number of characters.
Week 4 Course Notes: data-cleaning in R, by [Linda] (@lindangulopez)
Gg↩