Question 1: Subsetting a dataframe using regular expressions

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset…provide code that identifies the majors that contain either “DATA” or “STATISTICS”…

url <- "https://projects.fivethirtyeight.com/mid-levels/college-majors/index.html?v=3"

majors <- read_html(url)

majors_table <- html_nodes(majors, "table") %>%
  html_table(fill=TRUE) %>%
  .[[1]]

data_majors <- subset(majors_table, 
                      majors_table$MAJOR %in% 
                        grep(pattern = (".*DATA.*"),
                             majors_table$MAJOR, 
                             value = TRUE, 
                             ignore.case = TRUE)
                      )
                                                             
stats_majors <- subset(majors_table, 
                      majors_table$MAJOR %in% 
                        grep(pattern = (".*STATISTICS.*"),
                             majors_table$MAJOR, 
                             value = TRUE, 
                             ignore.case = TRUE)
                )

print(union_all(data_majors$MAJOR, stats_majors$MAJOR))
## [1] "Computer Programming & Data Processing"
## [2] "Mgmt. Information Systems & Statistics"
## [3] "Statistics & Decision Science"

Question 2: WriteLines for a column vector of strings

fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime" , "lychee", "mulberry", "olive", "salal berry")

writeLines(fruits)
## bell pepper
## bilberry
## blackberry
## blood orange
## blueberry
## cantaloupe
## chili pepper
## cloudberry
## elderberry
## lime
## lychee
## mulberry
## olive
## salal berry

Question 3: Regular Expressions (ft. Lady Gaga)

What are the perfect strings with which to test regex grouping and backreferences? Why, the delightfully repetitive lyrics of Lady Gaga’s “Bad Romance”, of course!

(.)\1\1 - this will match any single character that repeats twice in immediate succession:

test_string1 <- "Oh-oh-oh-oooh, oh-oh-oh / Caught in a bad romance"
str_match_all(test_string1, "(.)\\1\\1")
## [[1]]
##      [,1]  [,2]
## [1,] "ooo" "o"
## source: Lady Gaga, "Bad Romance" LyricFind

(.)(.)\2\1 - this will match any sequence of two characters that then repeats in reverse order:

test_string2 <- "I don't wanna be friends, want your bad romance"
str_match_all(test_string2, "(.)(.)\\2\\1")
## [[1]]
##      [,1]   [,2] [,3]
## [1,] "anna" "a"  "n"
## source: Lady Gaga, "Bad Romance" LyricFind

(..)\1 - this will match any sequence of two characters that repeats immediately. How about “Gaga,” for example, in test_string3…

test_string3 <- "Rah, rah-ah-ah-ah/ Roma, roma-ma/ Gaga, ooh-la-la/ Want your bad romance"
str_match_all(test_string3, "(..)\\1")
## [[1]]
##      [,1] [,2]
## source: Lady Gaga, "Bad Romance" LyricFind

…no dice! We need to use the PERL expression (?i) to make the match case-insensitve.

str_match_all(test_string3, "(?i)(..)\\1")
## [[1]]
##      [,1]   [,2]
## [1,] "Gaga" "Ga"

(.).\1.\1 - this will match any sequence of five characters in which the first, third and fifth are the same. We can almost get this from the two oh-oh-ohs in teststring1, if it weren’t for those pesky hyphens and the final h

str_match_all(test_string1, "(.).\\1.\\1")
## [[1]]
##      [,1] [,2]

…so, we’ll just add the hyphens in as literals along with a final dot and word boundary. We’ll also add case insensitivity to this regex expression in order to capture the first, capitalized Oh-oh-oh

str_match_all(test_string1, "(?i)(.).-\\1.-\\1.\\b")
## [[1]]
##      [,1]       [,2]
## [1,] "Oh-oh-oh" "O" 
## [2,] "oh-oh-oh" "o"

**(.)(.)(.).*\3\2\1** …here, we’re looking for a string of any 3 characters, followed by immediately by a zero-or-more-character-length substring of characters, followed by the first 3 characters in reverse order.

For this regex, we’ll need to turn to a more recent entry in the Gaga canon - I’m talking of course about her 2018 duet with Bradley Cooper from their blockbuster remake of A Star is Born:

test_string4 <- "In the sha-hal, sha-hal-low/ In the shallow, sha-la-la-la-low"

str_match_all (test_string4, "(.)(.)(.).*\\3\\2\\1")
## [[1]]
##      [,1]                                   [,2] [,3] [,4]
## [1,] "al-low/ In the shallow, sha-la-la-la" "a"  "l"  "-"
## Note: LyricFind gives this lyric as "In the shallow, shallow/ In the shallow,  shallow", but anyone who has belted along with Bradley and Gaga knows better than that!

Question 4: More Regex (ft. MORE GAGA)

Construct regular expressions to match words that…

  1. Match words that start and end with the same character:
test_string5 <- "Out in the club, and I'm sippin'  that bub / And you're not gonna reach my  telephone"

str_match_all(test_string5, "(?i)\\b([a-z])\\S*\\1\\b")
## [[1]]
##      [,1]   [,2]
## [1,] "that" "t" 
## [2,] "bub"  "b"
## source: Lady Gaga, "Telephone (ft. Beyonce)" LyricsFind
  1. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.):
test_string6 <- "I'm your biggest fan/ I'll follow  you until you love me/ Papa-paparazzi"

str_match_all(test_string6, "(?i)\\b\\S*(.)(.)\\S*\\1\\2\\S*\\b")
## [[1]]
##      [,1]             [,2] [,3]
## [1,] "Papa-paparazzi" "p"  "a"
## source: Lady Gaga, "Papparazzi" LyricsFind
  1. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
test_string7 <- "Don't be a drag, just be a queen /  Whether you're broke or evergreen"

str_match_all(test_string7, "(?i)\\b\\S*([a-z])\\S*\\1\\S*\\1\\S*\\b")
## [[1]]
##      [,1]        [,2]
## [1,] "evergreen" "e"
## Source: Lady Gaga, "Born this Way", Musixmatch