majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv", header = TRUE, sep = ",")
#From this dataframe, select the column listing majors and store it as a matrix
major_col <- as.matrix(majors[,2, drop=FALSE])
#Find the indices of majors that contain either "DATA" or "STATISTICS"
sapply("DATA", function(y) grep(y,major_col))
## DATA
## 52
sapply("STATISTICS", function(y) grep(y,major_col))
## STATISTICS
## [1,] 44
## [2,] 59
#Store and then display the majors of interest
matching_majors <- c(major_col[52], major_col[44], major_col[59])
matching_majors
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [2] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [3] "STATISTICS AND DECISION SCIENCE"
I interpreted this question as: given a matrix of strings, convert the input into a single comma separated vector with all corresponding strings.
produce <- c('[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"')
produce #verify what the input looks like
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\n [5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \n [9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \n[13] \"olive\" \"salal berry\""
Based on this input, we see that we’ll have to heavily process this string to break it into chunks we can store and then regurgitate.
#Handle special characters and digits
produce <- str_replace_all(produce, "[\\[\\]]", "") #remove square brackets
produce <- str_replace_all(produce, "[!^[:digit:]]", "") #remove digits
produce <- str_replace_all(produce, "\\\n", "") #remove "\n"
produce <- str_replace_all(produce, '[\"]', "'") #replace '\"' with "'"
#Handle white space
produce <- trimws(produce) #remove leading / trailing whitespace
produce <- str_replace_all(produce, "\\s+", " ") #compress whitespace
produce <- str_replace_all(produce, "' '", "','") #' ' ' --> ','
#Remove excess characters and properly split and then remerge the vector
produce <- str_replace_all(produce, "'", "") #remove remaining 's
produce <- str_split(produce, pattern=",") #convert vector to list at each ,
produce <- unlist(produce) #convert list back to vector
#Verify that our output matches what was desired
desired_output <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
produce
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
produce == desired_output
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
(.)\1\1
This expression would not match anything because the backreferencing was incorrectly done. If backreferencing were done with "", it would match a character repeated 3 consecutive times within a string because of the single backslash (a double backslash would back reference a regular expression).
#COMMENTED OUT BELOW because I could not knit with an error message
#str_view(fruit, (.)\1\1, match = TRUE)
“(.)(.)\2\1”
This expression matches a first letter and a second letter in the following order: 1st letter - 2nd letter - 2nd letter - 1st letter. From the fruits example it would return bell pepper and chili pepper. Note the order e-p-p-e.
str_view(fruit, "(.)(.)\\2\\1", match = TRUE)
(..)\1
This expression would not match anything because the backreferencing was incorrectly done. If backreferencing were done with "", it would match two characters repeated consecutively (ie. banana).
#COMMENTED OUT BELOW because I could not knit with an error message
#str_view(fruit, (..)\1, match = TRUE)
“(.).\1.\1”
This expression matches the first letter and any second letter repeated 2x with the first letter again at the end (5 letters total). For example: 1st letter - any 2nd letter - 1st letter - any 2nd letter - 1st letter. From our fruits example it would return banana and papaya. Note the order a.a.a
str_view(fruit, "(.).\\1.\\1", match = TRUE)
"(.)(.)(.).*\3\2\1"
This expression matches the 1st three letters with the last three letters with any letters in any order in between. For example: 1st letter - 2nd letter - 3rd letter - any letter sequence (…) - 3rd letter - 2nd letter - 1st letter. The fruits example doesn’t provide a fit, but a positive match could be: ‘hititontheyabadabatih’.
madeupword <- "hititontheyabadabatih"
str_view(madeupword, "(.)(.)(.).*\\3\\2\\1", match = TRUE)
Start and end with the same character: "^(.).*\1$"
string1 <- "dead"
string2 <- "defeated"
str_view(string1, "^(.).*\\1$")
str_view(string2, "^(.).*\\1$")
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): "(..).*\1"
string3 <- "church"
string4 <- "banana"
str_view(string3, "^(.).*\\1$")
str_view(string4, "^(.).*\\1$")
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): “(.).\1.\1”
string5 <- "eleven"
string6 <- "bee-eater"
str_view(string5, "(.).*\\1.*\\1", match = TRUE)
str_view(string6, "(.).*\\1.*\\1", match = TRUE)