Source file ⇒ lec22.Rmd
Here again is a useful cheat sheet
Write a regular expression that matches
x <- c(1, +12.4,-56.899, 0, -23, .000, -0, "4.5.3")
grepl("^[-+]?[0-9]+\\.?[0-9]*$",x)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#or
grepl("^[-+]?[[:digit:]]+\\.?[[:digit:]]*$",x)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
x <- c(" stats rocks", "I love stats")
grepl("^[[:blank:]]*[[:alpha:]]+[[:blank:]]+[[:alpha:]]+[[:blank:]]*$",x)
## [1] TRUE FALSE
grepl()
, gsub()
, extractMatches()
Regular expressions are used for several purposes:
filter()
and grepl()
details:
grepl() searches for matches to regexp within each element of a character vector and returns a logical vector
syntax: grepl(regex,vec)
example: Here is a list of names from the Bible:
BibleNames <- read.file("http://tiny.cc/dcf/BibleNames.csv")
head(BibleNames)
## name meaning
## 1 Aaron a teacher; lofty; mountain of strength
## 2 Abaddon the destroyer
## 3 Abagtha father of the wine-press
## 4 Abana made of stone; a building
## 5 Abarim passages; passengers
## 6 Abba father
Which names have any words in them: “bar”, “dam”, “lory”?
BibleNames %>%
filter(grepl("(bar)|(dam)|(lory)", name)) %>%
head()
## name meaning
## 1 Abarim passages; passengers
## 2 Aceldama field of blood
## 3 Adam earthy; red
## 4 Adamah red earth; of blood
## 5 Adami my man; red; earthy; human
## 6 Bethabara the house of confidence
mutate()
and gsub()
details:
gsub()
searches for matches to argument regexp within each element of a character vector and performs replacement of all matches
syntax:gsub(regex, replacement, vec)
Clean up the Debt data table below so it only has numbers
head(Debt)
## country debt percentGDP
## 1 World $56,308 billion 64%
## 2 United States $17,607 billion 73.60%
## 3 Japan $9,872 billion 214.30%
Debt %>%
mutate( debt=gsub("[$,]|billion","",debt),
percentGDP=gsub("%", "", percentGDP))
## country debt percentGDP
## 1 World 56308 64
## 2 United States 17607 73.60
## 3 Japan 9872 214.30
extractMatches()
from the DataComputing package.Details:
input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var,...)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches()
returns NA
example
data.frame(string=c("hi there there","bye")) %>% extractMatches("r(e)",string)
## string match1
## 1 hi there there e
## 2 bye <NA>
What are the most common end vowels for Bible names?
To answer this question, you have to extract the last vowel from the name. The extractMatches()
transformation function can do this.
BibleNames %>%
extractMatches( "([aeiou])$", name, vowel=1) %>%
group_by(vowel ) %>%
summarise( total= n()) %>%
arrange( vowel, desc(total) )
## Source: local data frame [6 x 2]
##
## vowel total
## (fctr) (int)
## 1 a 233
## 2 e 55
## 3 i 207
## 4 o 24
## 5 u 18
## 6 NA 2089
What Bible Names start and end with a vowel?
BibleNames %>%
extractMatches( "^([AEIOU]).*([aeiou])$", name, beg_vowel=1, end_vowel=2) %>%
filter(!is.na(beg_vowel) & !is.na(end_vowel)) %>%
select(-meaning) %>%
head()
## name beg_vowel end_vowel
## 1 Abagtha A a
## 2 Abana A a
## 3 Abba A a
## 4 Abda A a
## 5 Abdi A i
## 6 Abednego A o
BibleNames
have the word “man” in the meaning (not first or last word)?BibleNames %>%
filter(grepl("[[:blank:]]+man[[:blank:]]+",meaning))
## name meaning
## 1 Andronicus a man excelling others
## 2 Elkoshite a man of Elkeshai
## 3 Iscariot a man of murder; a hireling
## 4 Ishbosheth a man of shame
## 5 Lebbeus a man of heart; praising; confessing
names
in BibleNames
that end with a vowel (example: Grab Ev from Eva)BibleNames %>%
extractMatches(pattern="(.*)[aeiouy]$", name, root=1) %>%
filter(!is.na(root)) %>%
select(-meaning) %>%
head()
## name root
## 1 Abagtha Abagth
## 2 Abana Aban
## 3 Abba Abb
## 4 Abda Abd
## 5 Abdi Abd
## 6 Abednego Abedneg