Today

Regular expressions (chapter 16 book) —–Not on midterm

Regular expressions (chapter 16)

A regular expression (regex) is a pattern that describes a set of strings.

example: The regex “[a-cx-z]”" matches “a”,“b”,“c”,“x”,“y”,“z”

example: The regex “ba+d”" matches “bad”, “baad”, “baaad” etc but not “bd”

Regular expressions are used for several purposes:

to detect whether a pattern is contained in a string. Use filter() and grepl()
to replace the elements of that pattern with something else. Use mutate() and gsub()
to extract a component that matches the patterns. Use extractMatches() from the DataComputing package.

Examples of accomplishing tasks with regular expressions.

1. Here are some examples of patterns in names and the use of a regular expression to detect them.

Consider the baby names data, summarised to give the total count of each name for each sex.

NameList <- BabyNames %>% 
  mutate( name=tolower(name) ) %>%
  group_by( name, sex ) %>%
  summarise( total=sum(count) ) %>%
  arrange( desc(total))

The regular expression is the string in quotes. grepl() is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.

The name contains “shine”, as in “sunshine” or “moonshine”
```
NameList %>% 
  filter( grepl( "shine", name ) ) %>% 
  head()
```
name sex total

rashine M 11

shine F 95

shine M 44

shinea F 17

shinead F 19

shinece F 7
The name contains three or more vowels in a row.

name	sex	total
rashine	M	11
shine	F	95
shine	M	44
shinea	F	17
shinead	F	19
shinece	F	7

NameList %>% 
  filter( grepl( "[aeiou]{3,}", name ) ) %>% 
  head()

name	sex	total
aaidan	M	38
aaiden	M	472
aaidyn	M	29
aaila	F	6
aailiyah	F	10
aailyah	F	163

The name contains three or more consonants in a row.

NameList %>% 
  filter( grepl( "[^aeiou]{3,}", name ) ) %>% 
  head()

name	sex	total
aadarsh	M	140
aadhya	F	390
aadhyan	M	5
aadyn	M	354
aadyn	F	16
aaidyn	M	29

The name contains “mn”

NameList %>% 
  filter( grepl( "mn", name ) ) %>% 
  head()

name	sex	total
aamna	F	181
amna	F	1099
amnah	F	95
amneet	F	6
amneh	F	28
amner	M	33

The first, third, and fifth letters are consonants.

NameList %>% 
  filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>% 
  head()

name	sex	total
babacar	M	124
babajide	M	44
babak	M	287
babara	F	426
babatunde	M	344
babby	F	34

How often do boys’ and girls’ names end in vowels?

NameList %>%
  filter( grepl( "[aeiou]$", name ) ) %>% 
  group_by( sex ) %>% 
  summarise( total=sum(total) )

sex	total
F	96702371
M	21054791

Girls’ names are almost five times as likely to end in vowels as boys’ names.

What are the most common end vowels for names?

To answer this question, you have to extract the last vowel from the name. The extractMatches() transformation function can do this.

NameList %>% 
  extractMatches( "([aeiou])$", name, vowel=1 ) %>%
  group_by( sex, vowel ) %>% 
  summarise( total=sum(total) ) %>%
  arrange( sex, desc(total) )

sex	vowel	total
F	NA	68578358
F	a	56088501
F	e	36432218
F	i	3693024
F	o	403120
F	u	85508
M	NA	147082250
M	e	14341114
M	o	4041190
M	a	1844041
M	i	753311
M	u	75135

2. Here are some examples of data cleaning a table downloaded from the web.

Get rid of percent signs and commas in numerals

Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.

head(Debt,3)

country	debt	percentGDP
World	$56,308 billion	64%
United States	$17,607 billion	73.60%
Japan	$9,872 billion	214.30%

To use these numbers for computations, they must be cleaned up.

Debt %>% 
  mutate( debt=gsub("[$,%]|billion","",debt),
          percentGDP=gsub("[,%]", "", percentGDP)) %>%
  head(3)

country	debt	percentGDP
World	56308	64
United States	17607	73.60
Japan	9872	214.30

Remove a currency sign

gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))

## [1] "100.95" "45"

Remove leading or trailing spaces

gsub( "^ +| +$", "", "   My name is Julia     ")

## [1] "My name is Julia"

Reading Regular Expressions

There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.

There are many regular expression tutorials on the Internet, for instance this interactive one. Here is a useful cheat sheet

Some basics:

Very simple patterns:
- A single . means “any character.”
- A character, e.g., b, means just that character.
- Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
- The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam
Repeats
- Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
- A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.
- A simple pattern followed by a ? means “zero or one time.”
- A simple pattern followed by a * means “zero or more times.”
- A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
- ^ at the beginning of a regular expression means “the start of the string”
- $ at the end means “the end of the string.”

Task for you

Indicate which strings contain a match

Solution:

Note: you can experiment with grepl(pattern,x) for example:

x <- c("hi mabc","abc","abcd","abccd","abcabcdx","cab","abd","cad")
grepl("a[b?d]",x)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The metacharacters in Regular Expressions are:

. \ | ( ) [ ] { $ * + ?

Example:

grepl("[a\\.]", c("hello.bye", "adam"))

## [1] TRUE TRUE

There are also built-in character sets for commonly used collections.

Example:

x <- c ("a0b", "a1b","a12b")
grepl("a[[:digit:]]b",x)

## [1]  TRUE  TRUE FALSE

x <- c (3,"a","b","?")
grepl("[[:digit:]abc]",x)

## [1]  TRUE  TRUE  TRUE FALSE

Task for you

Write a regular expression that matches

only the strings “cat”,“at”,“t”

x <- c("cat","at","t","ta","ct")
grepl("^(ca|a)?t$",x)

## [1]  TRUE  TRUE  TRUE FALSE FALSE

#Doesn't work
x <- c("cat","at","t","ta","ct")
grepl("^c?a?t$",x)

## [1]  TRUE  TRUE  TRUE FALSE  TRUE

the strings “cat”, “caat”, “caaat”, and so on.

x <- c("cat","caat","caaat")
grepl("^ca+t$",x)

## [1] TRUE TRUE TRUE

“dog”, “Dog”,“dOg”,“doG”,“DOg”, etc. (i.e., the word dog in any combination of upper and lower case)

x <- c("dog", "Dog" , "dOg", "doG" , "DOg")
grepl("(d|D)(o|O)(g|G)",x)

## [1] TRUE TRUE TRUE TRUE TRUE

lec21

Announcements

Today

Regular expressions (chapter 16)

Examples of accomplishing tasks with regular expressions.

1. Here are some examples of patterns in names and the use of a regular expression to detect them.

2. Here are some examples of data cleaning a table downloaded from the web.

Reading Regular Expressions

Task for you

Task for you

For next time finish reading chapter 16 in book.