library(httr)
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(paste0(url), header = TRUE)
grep(pattern = 'STATISTICS|DATA', majors$Major, value = TRUE, ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
library(stringr)
startstr <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
dbl_quote = '"'
# Use function gregexpr to extract the substring = pattern
double_quotes_positions <- gregexpr(pattern = dbl_quote, text = startstr)
# check double_quotes_positions[[1]][1]
double_quotes_positions[[1]][1]
## [1] 5
# print head of double_quotes positions
head(double_quotes_positions[[1]])
## [1] 5 17 20 29 35 46
# store all the double_quotes positions into the vector dq_pos
dq_pos <- vector()
i <- 1
while (!is.na(double_quotes_positions[[1]][i])){
dq_pos[i] <- double_quotes_positions[[1]][i]
i <- i+1
}
no_of_words <- length(dq_pos)/2
desired_output <- vector (length=no_of_words)
#print(desired_output)
# checking the length of desired_output
length(desired_output)
## [1] 14
for (i in 1:no_of_words) {
desired_output[i] <- substring(startstr,double_quotes_positions[[1]][2*i-1]+1,double_quotes_positions[[1]][2*i]-1)
i <- i+1
}
# set the optional character to be \", \"
# Use writeLines to complete the deal
end_result <- paste0("c(\"", paste0(desired_output, collapse = "\", \""), "\")")
writeLines(end_result)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
(.)\1\1- This regular expression matches an expression containing the same three consecutive characters. It would match, for example, abbbc, addddc work but abc doesn’t."(.)(.)\\2\\1" - This is a string representing a regular expression that matches a pair of any characters followed by the reverse order of the same pair. So the first matching group is (.). Second matching group is (.).(..)\1- This regular expression matches any two characters followed by the same sequence of the same two characters. Possible matches are 1010, ababb, eabbabab. The matched expression doesn’t neccessarily have to be a string."(.).\\1.\\1" - It matches an expression that contains five characters where the first, third and fifth are the same and the second and fourth can be anything. Possible matches could be “abaaa”, “dedad”. Has to be enclosed in “” and anything of length not equal to 5 is going to fail. E.g. “a0a1af”"(.)(.)(.).*\\3\\2\\1" - It matches an exprssion that is 6 or more carachters where the first three characters of the 6 characters are the same as the last three in reverse order. It could be of length 7+ or 6. If it’s of length 7+, the middele string of characters can be anything. Possible matches could be something like “ereere”, “1ere3ere”, and “ere4443ere”. Has to be a string that is enclosed within a pair of “.Start and end with the same character. - ^(.).*\1$
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - ([a-zA-Z][a-zA-Z]).*\1
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) - ([a-zA-Z]).*\1.*\1