Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

dataURL <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(dataURL)
grep(pattern = 'DATA',majors$Major, value = TRUE, ignore.case = TRUE)
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
grep(pattern = 'STATISTICS',majors$Major, value = TRUE, ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruits <- data.frame(c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"))
names(fruits) <- "fruit"

cat(paste(fruits, collapse = ","))
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

#3 Describe, in words, what these expressions will match:

(.)\1\1

"(.)(.)\\2\\1"

(..)\1

"(.).\\1.\\1"

"(.)(.)(.).*\\3\\2\\1"

test1 <- list('"amana"', '"redder"', '"a\1\1"', '"anna"', 'sasquatch\1')
str_view(test1,'(.)\1\1', match = TRUE)
str_view(test1,'"(.)(.)\\2\\1"', match = TRUE)
str_view(test1,'(..)\1', match = TRUE)
str_view(test1,'"(.).\\1.\\1"', match = TRUE)
str_view(test1,'"(.)(.)(.).*\\3\\2\\1"', match = TRUE)

In order, expression 1, (.)\1\1, accepts a first character group (.) followed by two consecutive backreferences to the character group \1\1. The first backreference will search through the string in the character group one character at a time until a character is repeated, then the second backreference will search through the string again until it finds the first backreference, which matches it. This will match a string like a\1\1 in that the character group will contain the letter “a” which will be found by the first backreference as a match, and the second backreference will also find the character “a” in the character group, thus terminating the regex.

expression 2, "(.)(.)\\2\\1", accepts a first character group (.) followed by a second character group (.) followed by a backreference to the second character group \\2 finishing with a backreference to the first character group \\1. This will match a word like anna where the first letter and second letter are reversed to complete the second half of the word.

expression 3, (..)\1, accepts a character group with two digits (..) followed by a single backreference \1 to that two-character string. This will match a pattern like sasquatch\1 since when the string search finds the characters “ch” they are immediately repeated by the backreference. It will also match a pattern with a single character and two backreferences like a\1\1.

expression 4, "(.).\\1.\\1", accepts a first character group (.) and then a single character . follwed by a backreference to the first character group \\1, followed by another single character ., followed by a final backreference to the first character group \\1. This will match a pattern such as amana.

expression 5, "(.)(.)(.).*\\3\\2\\1", accepts three consecutive character groups (.)(.)(.) and applies three backreferences to the third, second and first character groups respectively \\3\\2\\1. This will match a word like ‘redder’ where the first three letters are reversed to complete the second half of the word.

#4 Construct regular expressions to match words that:

Start and end with the same character.

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

test2 <- c("kayak", "church", "discussion")

# first and last same
str_view(test2,"^(.).*\\1$", match = TRUE)
# repeated pair
str_view(test2,"(..).*\\1", match = TRUE)
# triple_threat
str_view(test2, "(.).*\\1.*\\1", match = TRUE)