1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(httr)

url  <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(paste0(url), header = TRUE)
grep(pattern = 'STATISTICS|DATA', majors$Major, value = TRUE, ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

library(stringr)

startstr <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

dbl_quote = '"'
# Use function gregexpr to extract the substring = pattern
double_quotes_positions <- gregexpr(pattern =  dbl_quote, text = startstr)
# check double_quotes_positions[[1]][1]
double_quotes_positions[[1]][1]
## [1] 5
# print head of double_quotes positions
head(double_quotes_positions[[1]])
## [1]  5 17 20 29 35 46
# store all the double_quotes positions into the vector dq_pos
dq_pos <- vector()
i <- 1
while (!is.na(double_quotes_positions[[1]][i])){
                              dq_pos[i] <- double_quotes_positions[[1]][i]
                              i <- i+1
}

no_of_words <- length(dq_pos)/2
desired_output <- vector (length=no_of_words)
#print(desired_output)
# checking the length of desired_output 
length(desired_output)
## [1] 14
for (i in 1:no_of_words) {
desired_output[i] <- substring(startstr,double_quotes_positions[[1]][2*i-1]+1,double_quotes_positions[[1]][2*i]-1)
i <- i+1
}

# set the optional character to be \", \"
# Use writeLines to complete the deal
end_result <- paste0("c(\"", paste0(desired_output, collapse = "\", \""), "\")")


writeLines(end_result)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3 Describe, in words, what these expressions will match:

4 Construct regular expressions to match words that: