Regular Expressions in R

This is a document for illustration of regular expression in R

grep

homicide <- readLines("homicides.txt")
homicide[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
length(grep("iconHomicideShooting", homicide))
## [1] 228
length(grep("iconHomicideShooting|icon_homicide_shooting", homicide))
## [1] 1003
i <- (grep("[Cc]ause: [Ss]hooting", homicide, value = TRUE))
i[1:3]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
## [2] "39.312641, -76.698948, iconHomicideShooting, 'p3', '<dl><dt>Eddie Golf</dt><dd class=\"address\">4900 Challedon Road<br />Baltimore, MD 21207</dd><dd>black male, 26 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting</dd></dl>'"      
## [3] "39.352676, -76.607979, iconHomicideShooting, 'p7', '<dl><dt>Michael Cunningham</dt><dd class=\"address\">5200 Ready Ave.<br />Baltimore, MD 21212</dd><dd>black male, 46 years old</dd><dd>Found on January 5, 2007</dd><dd>Victim died at JHH</dd><dd>Cause: shooting</dd></dl>'"

grep does not return the actual string matched

regexpr and gregexpr

Search a character vector for regular expression matches and return the indices where the match begins; useful in conunction with regmatches

regexpr("<dd>[F|f]ound(.*?)</dd>", homicide[1:10])
##  [1] 177 178 188 189 178 182 178 187 182 183
## attr(,"match.length")
##  [1] 33 33 33 33 33 33 33 33 33 33
## attr(,"useBytes")
## [1] TRUE
m1 <- substr(homicide[1], 177, 177 + 33 - 1)
m1
## [1] "<dd>Found on January 1, 2007</dd>"
r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicide[1:5])
m15 <- regmatches(homicide[1:5], r)

sub and gsub

Search a vector for regular expression and replace that match with another string

sub("<dd>[F|f]ound on | </dd>", "", m1)
## [1] "January 1, 2007</dd>"
gsub("<dd>[F|f]ound on |</dd>", "", m1)
## [1] "January 1, 2007"
gsub("<dd>[F|f]ound on |</dd>", "", m15)
## [1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007"
## [5] "January 5, 2007"

regexec

Gives you indices of parenthesized sub-expressions

r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicide)
substr(homicide[1], 177, 177 + 33 - 1)
## [1] "<dd>Found on January 1, 2007</dd>"
substr(homicide[1], 190, 190 + 15 - 1)
## [1] "January 1, 2007"
mm <- regmatches(homicide, r)
dates <- sapply(mm, function(x) x[2])
dates <- as.Date(dates, "%B %d, %Y")
hist(dates, "month", freq = TRUE)

plot of chunk unnamed-chunk-4