], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(httr)

url  <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(paste0(url), header = TRUE)
grep(pattern = 'STATISTICS|DATA', majors$Major, value = TRUE, ignore.case = TRUE)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

library(stringr)

startstr <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

dbl_quote = '"'
# Use function gregexpr to extract the substring = pattern
double_quotes_positions <- gregexpr(pattern =  dbl_quote, text = startstr)
# check double_quotes_positions[[1]][1]
double_quotes_positions[[1]][1]

## [1] 5

# print head of double_quotes positions
head(double_quotes_positions[[1]])

## [1]  5 17 20 29 35 46

# store all the double_quotes positions into the vector dq_pos
dq_pos <- vector()
i <- 1
while (!is.na(double_quotes_positions[[1]][i])){
                              dq_pos[i] <- double_quotes_positions[[1]][i]
                              i <- i+1
}

no_of_words <- length(dq_pos)/2
desired_output <- vector (length=no_of_words)
#print(desired_output)
# checking the length of desired_output 
length(desired_output)

## [1] 14

for (i in 1:no_of_words) {
desired_output[i] <- substring(startstr,double_quotes_positions[[1]][2*i-1]+1,double_quotes_positions[[1]][2*i]-1)
i <- i+1
}

# set the optional character to be \", \"
# Use writeLines to complete the deal
end_result <- paste0("c(\"", paste0(desired_output, collapse = "\", \""), "\")")


writeLines(end_result)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3 Describe, in words, what these expressions will match:

(.)\1\1- This regular expression matches an expression containing the same three consecutive characters. It would match, for example, abbbc, addddc work but abc doesn’t.
"(.)(.)\\2\\1" - This is a string representing a regular expression that matches a pair of any characters followed by the reverse order of the same pair. So the first matching group is (.). Second matching group is (.).
Meaning 1st character and 2nd character of the string can be any characters. \2 matches the same text as most recently matched by 2nd matching group. Likewise, \1 matches the same text as the 1st matching group. Possible matches are “abba”, “0110”, “….”. The requirement is the strong has to be exactly of length 4. The matching string has to be enclosed in double quotes.
(..)\1- This regular expression matches any two characters followed by the same sequence of the same two characters. Possible matches are 1010, ababb, eabbabab. The matched expression doesn’t neccessarily have to be a string.
"(.).\\1.\\1" - It matches an expression that contains five characters where the first, third and fifth are the same and the second and fourth can be anything. Possible matches could be “abaaa”, “dedad”. Has to be enclosed in “” and anything of length not equal to 5 is going to fail. E.g. “a0a1af”
"(.)(.)(.).*\\3\\2\\1" - It matches an exprssion that is 6 or more carachters where the first three characters of the 6 characters are the same as the last three in reverse order. It could be of length 7+ or 6. If it’s of length 7+, the middele string of characters can be anything. Possible matches could be something like “ereere”, “1ere3ere”, and “ere4443ere”. Has to be a string that is enclosed within a pair of “.

4 Construct regular expressions to match words that:

Start and end with the same character. - ^(.).*\1$
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - ([a-zA-Z][a-zA-Z]).*\1
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) - ([a-zA-Z]).*\1.*\1

Week 3 assignment - DATA 607

Dennis Pong

2/17/2020

1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

2 Write code that transforms the data below:

3 Describe, in words, what these expressions will match:

4 Construct regular expressions to match words that: