Source file ⇒ lec21.Rmd
Comming attractions:
A regular expression (regex) is a pattern that describes a set of strings.
example: The regex “[a-cx-z]”" matches “a”,“b”,“c”,“x”,“y”,“z”
example: The regex “ba+d”" matches “bad”, “baad”, “baaad” etc but not “bd”
Regular expressions are used for several purposes:
to detect whether a pattern is contained in a string. Use filter()
and grepl()
to replace the elements of that pattern with something else. Use mutate()
and gsub()
to extract a component that matches the patterns. Use extractMatches()
from the DataComputing package.
Consider the baby names data, summarised to give the total count of each name for each sex.
NameList <- BabyNames %>%
mutate( name=tolower(name) ) %>%
group_by( name, sex ) %>%
summarise( total=sum(count) ) %>%
arrange( desc(total))
The regular expression is the string in quotes. grepl()
is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.
The name contains “shine”, as in “sunshine” or “moonshine”
NameList %>%
filter( grepl( "shine", name ) ) %>%
head()
name | sex | total |
---|---|---|
rashine | M | 11 |
shine | F | 95 |
shine | M | 44 |
shinea | F | 17 |
shinead | F | 19 |
shinece | F | 7 |
The name contains three or more vowels in a row.
NameList %>%
filter( grepl( "[aeiou]{3,}", name ) ) %>%
head()
name | sex | total |
---|---|---|
aaidan | M | 38 |
aaiden | M | 472 |
aaidyn | M | 29 |
aaila | F | 6 |
aailiyah | F | 10 |
aailyah | F | 163 |
NameList %>%
filter( grepl( "[^aeiou]{3,}", name ) ) %>%
head()
name | sex | total |
---|---|---|
aadarsh | M | 140 |
aadhya | F | 390 |
aadhyan | M | 5 |
aadyn | M | 354 |
aadyn | F | 16 |
aaidyn | M | 29 |
NameList %>%
filter( grepl( "mn", name ) ) %>%
head()
name | sex | total |
---|---|---|
aamna | F | 181 |
amna | F | 1099 |
amnah | F | 95 |
amneet | F | 6 |
amneh | F | 28 |
amner | M | 33 |
NameList %>%
filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>%
head()
name | sex | total |
---|---|---|
babacar | M | 124 |
babajide | M | 44 |
babak | M | 287 |
babara | F | 426 |
babatunde | M | 344 |
babby | F | 34 |
NameList %>%
filter( grepl( "[aeiou]$", name ) ) %>%
group_by( sex ) %>%
summarise( total=sum(total) )
sex | total |
---|---|
F | 96702371 |
M | 21054791 |
Girls’ names are almost five times as likely to end in vowels as boys’ names.
To answer this question, you have to extract the last vowel from the name. The extractMatches()
transformation function can do this.
NameList %>%
extractMatches( "([aeiou])$", name, vowel=1 ) %>%
group_by( sex, vowel ) %>%
summarise( total=sum(total) ) %>%
arrange( sex, desc(total) )
sex | vowel | total |
---|---|---|
F | NA | 68578358 |
F | a | 56088501 |
F | e | 36432218 |
F | i | 3693024 |
F | o | 403120 |
F | u | 85508 |
M | NA | 147082250 |
M | e | 14341114 |
M | o | 4041190 |
M | a | 1844041 |
M | i | 753311 |
M | u | 75135 |
Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.
head(Debt,3)
country | debt | percentGDP |
---|---|---|
World | $56,308 billion | 64% |
United States | $17,607 billion | 73.60% |
Japan | $9,872 billion | 214.30% |
To use these numbers for computations, they must be cleaned up.
Debt %>%
mutate( debt=gsub("[$,%]|billion","",debt),
percentGDP=gsub("[,%]", "", percentGDP)) %>%
head(3)
country | debt | percentGDP |
---|---|---|
World | 56308 | 64 |
United States | 17607 | 73.60 |
Japan | 9872 | 214.30 |
gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))
## [1] "100.95" "45"
gsub( "^ +| +$", "", " My name is Julia ")
## [1] "My name is Julia"
There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.
There are many regular expression tutorials on the Internet, for instance this interactive one. Here is a useful cheat sheet
Some basics:
A single .
means “any character.”
A character, e.g., b
, means just that character.
Characters enclosed in square brackets, e.g., [aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)
The ^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam
matches Adam
and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou]
means a capital M followed by a lower-case vowel.
A simple pattern followed by a +
means “one or more times.” For example M(ab)+
means M
followed by one or more ab
.
A simple pattern followed by a ?
means “zero or one time.”
A simple pattern followed by a *
means “zero or more times.”
A simple pattern followed by {2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.
Start and end of strings. For instance, [aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”
$
at the end means “the end of the string.”
Indicate which strings contain a match
Solution:
Note: you can experiment with grepl(pattern,x)
for example:
x <- c("hi mabc","abc","abcd","abccd","abcabcdx","cab","abd","cad")
grepl("a[b?d]",x)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The metacharacters in Regular Expressions are:
.
\
|
(
)
[
]
{
$
*
+
?
Example:
grepl("[a\\.]", c("hello.bye", "adam"))
## [1] TRUE TRUE
There are also built-in character sets for commonly used collections.
Example:
x <- c ("a0b", "a1b","a12b")
grepl("a[[:digit:]]b",x)
## [1] TRUE TRUE FALSE
x <- c (3,"a","b","?")
grepl("[[:digit:]abc]",x)
## [1] TRUE TRUE TRUE FALSE
Write a regular expression that matches
x <- c("cat","at","t","ta","ct")
grepl("^(ca|a)?t$",x)
## [1] TRUE TRUE TRUE FALSE FALSE
#Doesn't work
x <- c("cat","at","t","ta","ct")
grepl("^c?a?t$",x)
## [1] TRUE TRUE TRUE FALSE TRUE
x <- c("cat","caat","caaat")
grepl("^ca+t$",x)
## [1] TRUE TRUE TRUE
x <- c("dog", "Dog" , "dOg", "doG" , "DOg")
grepl("(d|D)(o|O)(g|G)",x)
## [1] TRUE TRUE TRUE TRUE TRUE