Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset…provide code that identifies the majors that contain either “DATA” or “STATISTICS”…
url <- "https://projects.fivethirtyeight.com/mid-levels/college-majors/index.html?v=3"
majors <- read_html(url)
majors_table <- html_nodes(majors, "table") %>%
html_table(fill=TRUE) %>%
.[[1]]
data_majors <- subset(majors_table,
majors_table$MAJOR %in%
grep(pattern = (".*DATA.*"),
majors_table$MAJOR,
value = TRUE,
ignore.case = TRUE)
)
stats_majors <- subset(majors_table,
majors_table$MAJOR %in%
grep(pattern = (".*STATISTICS.*"),
majors_table$MAJOR,
value = TRUE,
ignore.case = TRUE)
)
print(union_all(data_majors$MAJOR, stats_majors$MAJOR))
## [1] "Computer Programming & Data Processing"
## [2] "Mgmt. Information Systems & Statistics"
## [3] "Statistics & Decision Science"
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime" , "lychee", "mulberry", "olive", "salal berry")
writeLines(fruits)
## bell pepper
## bilberry
## blackberry
## blood orange
## blueberry
## cantaloupe
## chili pepper
## cloudberry
## elderberry
## lime
## lychee
## mulberry
## olive
## salal berry
What are the perfect strings with which to test regex grouping and backreferences? Why, the delightfully repetitive lyrics of Lady Gaga’s “Bad Romance”, of course!
(.)\1\1 - this will match any single character that repeats twice in immediate succession:
test_string1 <- "Oh-oh-oh-oooh, oh-oh-oh / Caught in a bad romance"
str_match_all(test_string1, "(.)\\1\\1")
## [[1]]
## [,1] [,2]
## [1,] "ooo" "o"
## source: Lady Gaga, "Bad Romance" LyricFind
(.)(.)\2\1 - this will match any sequence of two characters that then repeats in reverse order:
test_string2 <- "I don't wanna be friends, want your bad romance"
str_match_all(test_string2, "(.)(.)\\2\\1")
## [[1]]
## [,1] [,2] [,3]
## [1,] "anna" "a" "n"
## source: Lady Gaga, "Bad Romance" LyricFind
(..)\1 - this will match any sequence of two characters that repeats immediately. How about “Gaga,” for example, in test_string3…
test_string3 <- "Rah, rah-ah-ah-ah/ Roma, roma-ma/ Gaga, ooh-la-la/ Want your bad romance"
str_match_all(test_string3, "(..)\\1")
## [[1]]
## [,1] [,2]
## source: Lady Gaga, "Bad Romance" LyricFind
…no dice! We need to use the PERL expression (?i) to make the match case-insensitve.
str_match_all(test_string3, "(?i)(..)\\1")
## [[1]]
## [,1] [,2]
## [1,] "Gaga" "Ga"
(.).\1.\1 - this will match any sequence of five characters in which the first, third and fifth are the same. We can almost get this from the two oh-oh-ohs in teststring1, if it weren’t for those pesky hyphens and the final h…
str_match_all(test_string1, "(.).\\1.\\1")
## [[1]]
## [,1] [,2]
…so, we’ll just add the hyphens in as literals along with a final dot and word boundary. We’ll also add case insensitivity to this regex expression in order to capture the first, capitalized Oh-oh-oh…
str_match_all(test_string1, "(?i)(.).-\\1.-\\1.\\b")
## [[1]]
## [,1] [,2]
## [1,] "Oh-oh-oh" "O"
## [2,] "oh-oh-oh" "o"
**(.)(.)(.).*\3\2\1** …here, we’re looking for a string of any 3 characters, followed by immediately by a zero-or-more-character-length substring of characters, followed by the first 3 characters in reverse order.
For this regex, we’ll need to turn to a more recent entry in the Gaga canon - I’m talking of course about her 2018 duet with Bradley Cooper from their blockbuster remake of A Star is Born:
test_string4 <- "In the sha-hal, sha-hal-low/ In the shallow, sha-la-la-la-low"
str_match_all (test_string4, "(.)(.)(.).*\\3\\2\\1")
## [[1]]
## [,1] [,2] [,3] [,4]
## [1,] "al-low/ In the shallow, sha-la-la-la" "a" "l" "-"
## Note: LyricFind gives this lyric as "In the shallow, shallow/ In the shallow, shallow", but anyone who has belted along with Bradley and Gaga knows better than that!
Construct regular expressions to match words that…
test_string5 <- "Out in the club, and I'm sippin' that bub / And you're not gonna reach my telephone"
str_match_all(test_string5, "(?i)\\b([a-z])\\S*\\1\\b")
## [[1]]
## [,1] [,2]
## [1,] "that" "t"
## [2,] "bub" "b"
## source: Lady Gaga, "Telephone (ft. Beyonce)" LyricsFind
test_string6 <- "I'm your biggest fan/ I'll follow you until you love me/ Papa-paparazzi"
str_match_all(test_string6, "(?i)\\b\\S*(.)(.)\\S*\\1\\2\\S*\\b")
## [[1]]
## [,1] [,2] [,3]
## [1,] "Papa-paparazzi" "p" "a"
## source: Lady Gaga, "Papparazzi" LyricsFind
test_string7 <- "Don't be a drag, just be a queen / Whether you're broke or evergreen"
str_match_all(test_string7, "(?i)\\b\\S*([a-z])\\S*\\1\\S*\\1\\S*\\b")
## [[1]]
## [,1] [,2]
## [1,] "evergreen" "e"
## Source: Lady Gaga, "Born this Way", Musixmatch