1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

Solution

library(tidyverse)

df_p1 <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')

# Checking to see which columns contain the patterns through coercive behavior of str_detect (in case a data frame is passed as argument, each column's records are collapsed into a single string, then str_detect returns TRUE for each column containing the patterns)
str_detect(df_p1, '(DATA|STATISTICS)')
## [1] FALSE  TRUE FALSE
str_subset(df_p1[[2]], pattern = '(DATA|STATISTICS)')
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"
# In a single line of code:
str_subset(df_p1[[which(str_detect(df_p1, '(DATA|STATISTICS)'))]], pattern = '(DATA|STATISTICS)')
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

2. Write code that transforms the data below:

## [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
## 
## [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
## 
## [9] "elderberry"   "lime"         "lychee"       "mulberry"    
## 
## [13] "olive"        "salal berry"

Into a format like this:

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Solution

input_v <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

# Lazy-matching (*? instead of *) is required so that each item becomes an element inside the first and only vector of list_items

list_items <- str_extract_all(string = input_v, pattern = '\\".*?\\"')
items <- str_c(list_items[[1]], collapse = ', ')
str_glue('c({items})', items = items)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

3. Describe, in words, what these expressions will match:

4. Construct regular expressions to match words that:

I assume the term “words” refers to actual words and not arbitrary sets of any characters. Additionally, I cannot find any way to do case-insensitive backreferencing, so I am using the lowercase alphabets.

str_subset(string = c('lol', ' madam', 'cat'), pattern = '(^|\\s)([a-z])(([a-z]+\\2(\\s|$))|\\2?(\\s|$))')
## [1] "lol"    " madam"
str_subset(string = c('tomato', ' mississippi ', 'what'), pattern = '(^|\\s)[a-z]*([a-z][a-z])[a-z]*\\2[a-z]*(\\s|$)')
## [1] "tomato"        " mississippi "
str_subset(string = c('applepie', ' monsoon ', 'panda'), pattern = '(^|\\s)[a-z]*([a-z])[a-z]*\\2[a-z]*\\2[a-z]*(\\s|$)')
## [1] "applepie"  " monsoon "