1. Provide code that identifies the majors that contain either “DATA” or “STATISTICS” in fivethirtyeight.com’s College Majors dataset.

majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", header = TRUE, sep = ",")

#From this dataframe, select the column listing majors and store it as a matrix
major_col <- as.matrix(majors[,2, drop=FALSE])

#Find the indices of majors that contain either "DATA" or "STATISTICS"
sapply("DATA", function(y) grep(y,major_col)) 
## DATA 
##   52
sapply("STATISTICS", function(y) grep(y,major_col))
##      STATISTICS
## [1,]         44
## [2,]         59
#Store and then display the majors of interest
matching_majors <- c(major_col[52], major_col[44], major_col[59])
matching_majors
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [3] "STATISTICS AND DECISION SCIENCE"

2. Write code that transforms the data below:

I interpreted this question as: given a matrix of strings, convert the input into a single comma separated vector with all corresponding strings.

produce <- c('[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
 [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
 [9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"')
produce #verify what the input looks like
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n [5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n [9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n[13] \"olive\"        \"salal berry\""

Based on this input, we see that we’ll have to heavily process this string to break it into chunks we can store and then regurgitate.

#Handle special characters and digits
produce <- str_replace_all(produce, "[\\[\\]]", "") #remove square brackets
produce <- str_replace_all(produce, "[!^[:digit:]]", "") #remove digits
produce <- str_replace_all(produce, "\\\n", "") #remove "\n"
produce <- str_replace_all(produce, '[\"]', "'") #replace '\"' with "'"

#Handle white space
produce <- trimws(produce) #remove leading / trailing whitespace
produce <- str_replace_all(produce, "\\s+", " ") #compress whitespace
produce <- str_replace_all(produce, "' '", "','") #' ' ' --> ','

#Remove excess characters and properly split and then remerge the vector
produce <- str_replace_all(produce, "'", "") #remove remaining 's
produce <- str_split(produce, pattern=",") #convert vector to list at each ,
produce <- unlist(produce) #convert list back to vector

#Verify that our output matches what was desired
desired_output <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
produce 
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
produce == desired_output
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3. Describe, in words, what these expressions will match:

(.)\1\1

This expression would not match anything because the backreferencing was incorrectly done. If backreferencing were done with "", it would match a character repeated 3 consecutive times within a string because of the single backslash (a double backslash would back reference a regular expression).

#COMMENTED OUT BELOW because I could not knit with an error message
#str_view(fruit, (.)\1\1, match = TRUE)

“(.)(.)\2\1”

This expression matches a first letter and a second letter in the following order: 1st letter - 2nd letter - 2nd letter - 1st letter. From the fruits example it would return bell pepper and chili pepper. Note the order e-p-p-e.

str_view(fruit, "(.)(.)\\2\\1", match = TRUE)

(..)\1

This expression would not match anything because the backreferencing was incorrectly done. If backreferencing were done with "", it would match two characters repeated consecutively (ie. banana).

#COMMENTED OUT BELOW because I could not knit with an error message
#str_view(fruit, (..)\1, match = TRUE)

“(.).\1.\1”

This expression matches the first letter and any second letter repeated 2x with the first letter again at the end (5 letters total). For example: 1st letter - any 2nd letter - 1st letter - any 2nd letter - 1st letter. From our fruits example it would return banana and papaya. Note the order a.a.a

str_view(fruit, "(.).\\1.\\1", match = TRUE)

"(.)(.)(.).*\3\2\1"

This expression matches the 1st three letters with the last three letters with any letters in any order in between. For example: 1st letter - 2nd letter - 3rd letter - any letter sequence (…) - 3rd letter - 2nd letter - 1st letter. The fruits example doesn’t provide a fit, but a positive match could be: ‘hititontheyabadabatih’.

madeupword <- "hititontheyabadabatih"
str_view(madeupword, "(.)(.)(.).*\\3\\2\\1", match = TRUE)

4. Construct regular expressions to match words that:

Start and end with the same character: "^(.).*\1$"

string1 <- "dead"
string2 <- "defeated"

str_view(string1, "^(.).*\\1$")
str_view(string2, "^(.).*\\1$")

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): "(..).*\1"

string3 <- "church"
string4 <- "banana"

str_view(string3, "^(.).*\\1$")
str_view(string4, "^(.).*\\1$")

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): “(.).\1.\1”

string5 <- "eleven"
string6 <- "bee-eater"

str_view(string5, "(.).*\\1.*\\1", match = TRUE)
str_view(string6, "(.).*\\1.*\\1", match = TRUE)