This week’s assignment comprises the following:
provide code that identifies College Majors containing “DATA” or “STATISTICS”
provide code that transforms a set of data
Describe, in words, what specified expressions will match
Construct regular expressions to match words
Following is the code to identify college majors containing “DATA” or “STATISTICS”.
# find majors containing specific strings
str_to_title(str_subset(str_to_lower(five38_df$Major), str_to_lower("DATA|SCIENCES")))
## [1] "Animal Sciences"
## [2] "Biochemical Sciences"
## [3] "Computer Programming And Data Processing"
## [4] "Information Sciences"
## [5] "Nutrition Sciences"
## [6] "Communication Disorders Sciences And Services"
## [7] "Pharmacy Pharmaceutical Sciences And Administration"
## [8] "Family And Consumer Sciences"
## [9] "Transportation Sciences And Technologies"
## [10] "Physical Sciences"
## [11] "Atmospheric Sciences And Meteorology"
## [12] "Geosciences"
## [13] "Interdisciplinary Social Sciences"
## [14] "General Social Sciences"
## [15] "Miscellaneous Social Sciences"
Transform this vector of data from one form into another form:
# transform set of set
(task_var <-c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry"))
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
(new_var <- glue::glue('c("{fruitlist}")',
fruitlist = glue::glue_collapse(task_var, sep = '", "')))
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
# patterns
# (.)\1\1
# "(.)(.)\\2\\1"
# (..)\1
# "(.).\\1.\\1"
# "(.)(.)(.).*\\3\\2\\1"
# 1. I was expecting the pattern below to find 1 character group, number it as \1 then repeat it twice. expected result aaa,bbb but didn't work
str_view(c("aaa","abc","bbb"),"(.)\1\1") # from book R4DS, maybe wrong?
# Try pattern again but using 2 backslashes, book typo?
str_view(c("aaa","abc","bbb"),"(.)\\1\\1") # try pattern with 2 backslashes
# code from slack post. not sure what \1\1 really is. ESC and something?
test <- list("777", "data\1\1", "anna", "2002", '"elle"')
str_view(test, "(.)\1\1", match = TRUE)
# 2. the below example defines 2 character groups, numbers them 1 and 2 respectively, then reverses the character groups
str_view(c("abba","baba","baab","saab"),"(.)(.)\\2\\1")
# 3. I was expecting the pattern below to find 1 character group, number it as \1 then repeat it twice. expected result aaa,bbb but didn't work
str_view(c("aaaa","abab","bbbc"),"(..)\1")
# Try pattern again but using 2 backslashes, book typo?
str_view(c("aaaa","abab","bbbc"),"(..)\\1") # try again with 2 backslashes
# 4. The pattern below finds a character, becomes group 1, followed by any char, group 1 character, followed by any character, then followed by same character as group 1. Expected result: abaca
str_view(c("abaca","bcdbb"),"(.).\\1.\\1") #abaca
# 5. The pattern below finds 3 singular character groups, numbering each group respectively as 1,2,3, followed by zero or more characters then ending with the first 3 characters in reverse order. Expected result: abccba
str_view(c("abccba","abcdcba","abc123"),"(.)(.)(.).*\\3\\2\\1") #abccba
## Task 4: Construct regular expressions to match words
#
str_subset(five38_df$Major, "^(.)((.*\\1$)|\\1$)") # start & end same char
## [1] "STUDIO ARTS"
## [2] "ENVIRONMENTAL SCIENCE"
## [3] "GENERAL ENGINEERING"
## [4] "ENGINEERING MECHANICS PHYSICS AND SCIENCE"
## [5] "GEOLOGICAL AND GEOPHYSICAL ENGINEERING"
## [6] "ENGLISH LANGUAGE AND LITERATURE"
## [7] "COMPOSITION AND RHETORIC"
str_subset ("church", "([A-Za-z][A-Za-z]).*\\1") # repeated pair of letters
## [1] "church"
str_subset("eleven", "([A-Za-z]).*\\1.*\\1") # one letter repeated in 3 places
## [1] "eleven"
Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.