1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#original URL on fivethirtyeight,com:
#https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv
majors_csv <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/assignment3/majors-list.csv")
majors <- read.csv(text = majors_csv)
majors_subset <- str_subset(majors$Major, "(DATA|STATISTICS)")
majors_subset
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

raw_string <- '
[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
  
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
  
[13] "olive"        "salal berry"
'
matches <- str_match_all(raw_string, '"(.+?)"\\s+')
vectored_list <- matches[[1]][,2]
vectored_list
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

3 Describe, in words, what these expressions will match:

Any three characters in a row, e.g. AA

s <- "abcAAAabc"
str_extract(s, "(.)\\1\\1")
## [1] "AAA"

A four-character palindrome, e.g ABBA

s <- "abcABBAabc"
str_extract(s, "(.)(.)\\2\\1")
## [1] "ABBA"

A set of two repeating characters, e.g. ABAB

s <- "abcABABabc"
str_extract(s, "(..)\\1")
## [1] "ABAB"

The same character repeated three times, separated by any other character in between, e.g. AxAyA

s <- "abcAxAyAabc"
str_extract(s, "(.).\\1.\\1")
## [1] "AxAyA"

A six-character palindrome separated in the middle by any number of characters, e.g. ABCxyzCBA

s <- "abcABCxyzCBAabc"
str_extract(s, "(.)(.)(.).*\\3\\2\\1")
## [1] "ABCxyzCBA"

4 Construct regular expressions to match words that:

^(.).*\1$

s <- "AasdfasdfA"
str_extract(s, "^(.).*\\1$")
## [1] "AasdfasdfA"

([A-Za-z]{2}).*\1

s <- "church"
str_extract(s, "([A-Za-z]{2}).*\\1")
## [1] "church"

([A-Za-z]).\1.\1

s <- "eleven"
str_extract(s, "([A-Za-z]).*\\1.*\\1")
## [1] "eleve"