The purpose of this assignment was to practice using regular expressions in R. These questions are from the end of Chapter 8 in the textbook Automated Data Collection with R.
first_name
last_name
.(i) To start, I loaded the raw data and extracted the names.
namesraw <- c("555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert")
names <- unlist(str_extract_all(namesraw, "[[:alpha:]., ]{2,}"))
names
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
(ii) Then, I removed the titles from the names, and trimmed the white space left over. I identified the titles by looking for a pattern of 2 to 3 characters plus a period.
trimmed <- str_replace_all(names, "[[:alpha:]]{2,3}\\.", "")
trimmed <- str_trim(trimmed)
trimmed
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Julius Hibbert"
(iii) To fix the names, my goal was to create clean lists of first names and last names that I could easily concatenate back together in the right order. I started by creating a clean list of last names. I identified last names as:
- the pattern of all printable characters before a comma; OR
- the specific pattern of 1 character + 1 blank space + any number of characters.
Then, I removed the 1 character + 1 blank space pattern, and also removed the commas. This left me with a clean list of last names.
last_names <- str_extract_all(trimmed,"[[:print:]]+[\\,]|[[:alpha:]][[:blank:]]{1}[[:alpha:]]+")
last_names <- str_replace_all(last_names,"[[:alpha:]]{1}[[:blank:]]{1}", "")
last_names <- str_replace_all(last_names, "\\,","")
last_names
## [1] "Szyslak" "Burns" "Lovejoy" "Flanders" "Simpson" "Hibbert"
(iv) The clean list of first names was easier. I identified first names as:
- the pattern of all printable characters after a comma, OR
- the pattern of characters + blank spaces.
Then, I removed the commas and trimmed the blank spaces, and got a clean list of first names.
first_names <- str_extract_all(trimmed,"[\\,][[:print:]]+|[[:alpha:]]+[[:blank:]]+")
first_names <- str_replace_all(first_names,"[\\,]+[[:blank:]]{1,2}", "")
first_names <- str_trim(first_names)
first_names
## [1] "Moe" "C. Montgomery" "Timothy" "Ned"
## [5] "Homer" "Julius"
(v) Finally, I concatenated both lists in the right order and separated the names with a space.
final <- str_c(first_names, last_names, sep=" ")
final
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
A title is a pattern of two or three characters followed by a period. Two of the names fit this pattern.
title <- grepl("[[:alpha:]]{2,3}\\.", names)
title
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
I used my cleaned list of first and last names to answer this question, since it only contains names (without titles). I identified a second name as characters followed by a period. Only one name fit this pattern.
secondname <- grepl("[[:alpha:]]+\\.", final)
secondname
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
[0-9]+\\$
This pattern matches a string that has a number from 0 to 9 (at least once) followed by a dollar sign.
Example: “4$”
a1 <- c("4$", "4", "10$")
a2 <- grepl("[0-9]+\\$", a1)
a2
## [1] TRUE FALSE TRUE
\\b[a-z]{1,4}\\b
This pattern describes a string that has 1-4 letters from edge to edge.
Example: “dddd”
b1 <- c("hodc", "ff", "dddd", "5gs", "ggggg")
b2 <- grepl("\\b[a-z]{1,4}\\b", b1)
b2
## [1] TRUE TRUE TRUE FALSE FALSE
.*?\\.txt$
This pattern describes a string of any length (found at least once) that ends in “.txt”.
Example: “assignment.txt”
c1 <- c("assignment.txt", "m.txt", ".txtm", "r.txt.y.txt")
c2 <- grepl(".*?\\.txt$", c1)
c2
## [1] TRUE TRUE FALSE TRUE
\\d{2}/\\d{2}/\\d{4}
This pattern describes a string with 2 digits, a forward slash, two digits, another forward slash, and 4 digits. It is constructed like a date.
Example: “11/22/2018”
d1 <- c("11222018", "11/22/2018")
d2 <- grepl("\\d{2}/\\d{2}/\\d{4}", d1)
d2
## [1] FALSE TRUE
<(.+?)>.+?</\\1>
This pattern matches an HTML tag with at least one character between the tags. It also has a backreference
\1
.Example: “
<b>bold</b>
”
e1 <- c("<b>bold</b>", "<i>italics</i>")
e2 <- grepl("<(.+?)>.+?</\\1>", e1)
e2
## [1] TRUE TRUE
secret <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
secret <- str_replace_all(secret, "[[:digit:]]", "")
secret <- str_replace_all(secret, "[[:lower:]]", "")
secret <- str_replace_all(secret, "\\.", " ")
secret
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"