1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
#original URL on fivethirtyeight,com:
#https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv
majors_csv <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/assignment3/majors-list.csv")
majors <- read.csv(text = majors_csv)
majors_subset <- str_subset(majors$Major, "(DATA|STATISTICS)")
majors_subset
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
2 Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
raw_string <- '
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
'
matches <- str_match_all(raw_string, '"(.+?)"\\s+')
vectored_list <- matches[[1]][,2]
vectored_list
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
3 Describe, in words, what these expressions will match:
Any three characters in a row, e.g. AA
s <- "abcAAAabc"
str_extract(s, "(.)\\1\\1")
## [1] "AAA"
A four-character palindrome, e.g ABBA
s <- "abcABBAabc"
str_extract(s, "(.)(.)\\2\\1")
## [1] "ABBA"
A set of two repeating characters, e.g. ABAB
s <- "abcABABabc"
str_extract(s, "(..)\\1")
## [1] "ABAB"
The same character repeated three times, separated by any other character in between, e.g. AxAyA
s <- "abcAxAyAabc"
str_extract(s, "(.).\\1.\\1")
## [1] "AxAyA"
A six-character palindrome separated in the middle by any number of characters, e.g. ABCxyzCBA
s <- "abcABCxyzCBAabc"
str_extract(s, "(.)(.)(.).*\\3\\2\\1")
## [1] "ABCxyzCBA"
4 Construct regular expressions to match words that:
^(.).*\1$
s <- "AasdfasdfA"
str_extract(s, "^(.).*\\1$")
## [1] "AasdfasdfA"
([A-Za-z]{2}).*\1
s <- "church"
str_extract(s, "([A-Za-z]{2}).*\\1")
## [1] "church"
([A-Za-z]).\1.\1
s <- "eleven"
str_extract(s, "([A-Za-z]).*\\1.*\\1")
## [1] "eleve"