Overview: The assignment for this week is related to Regular Expressions.

Load the Tidyverse packages.

library(tidyverse)

Read majors data from CSV

theUrl <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(file=theUrl, header=TRUE, sep=",", stringsAsFactors = FALSE)
head(majors)
##   FOD1P                                 Major
## 1  1100                   GENERAL AGRICULTURE
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3  1102                AGRICULTURAL ECONOMICS
## 4  1103                       ANIMAL SCIENCES
## 5  1104                          FOOD SCIENCE
## 6  1105            PLANT SCIENCE AND AGRONOMY
##                    Major_Category
## 1 Agriculture & Natural Resources
## 2 Agriculture & Natural Resources
## 3 Agriculture & Natural Resources
## 4 Agriculture & Natural Resources
## 5 Agriculture & Natural Resources
## 6 Agriculture & Natural Resources

Identifies the majors that contain either “DATA” or “STATISTICS”

str_view(majors$Major, "DATA|STATISTICS", match=TRUE)

Transform data from one format to another format

a <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'
aa <- str_c(unlist(str_extract_all(a, "\"([a-z]+)\\s*([a-z]+)*\"")), collapse=", ")
cat("c(", aa, ")")
## c( "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry" )

#3 Describe, in words, what these expressions will match:

A. (.)\1\1 - (.) is any character except newline. The backreference \1 references the first capturing group (.). The second \1 also references the first capturing group(.)

regex_expr = '(.)\\1\\1'
test_cases = c('aaa','"aaa"', "aba")
str_view_all(test_cases, regex_expr)

B. “(.)(.)\2\1” - It is a string that defines a regular expression. 2 x (.) are any 2 characters except newline. \2 refers to the second (.) while \1 referes to the first (.) from the left.

regex_expr = '"(.)(.)\\2\\1"'
test_cases = c('anna','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

C. (..)\1 - (..) is any 2 characters except newline. The backreference \1 references the first capturing group (..)

regex_expr = '(..)\\1'
test_cases = c('abab', 'anna','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

D. “(.).\1.\1”- It is a string that defines a regular expression. There are 5 (any) characters altogether. Both \1 are refering to the first capturing group (.). -

regex_expr = "(.).\\1.\\1"
test_cases = c('abaca', 'abab', 'anna','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

E. "(.)(.)(.).\3\2\1" - 3 capturing group (.) of any character, 1 any character, represents 0 or more of any character , \3 references the third (.), \2 references the second (.) and \1 references the first (.).

regex_expr = "(.)(.)(.).*\\3\\2\\1"
test_cases = c('abcdefgcba', 'abab', 'anna','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

#4 Construct regular expressions to match words that:

A. Start and end with the same character. - "^(.).*\1$"

regex_expr = "^(.).*\\1$"
test_cases = c('abab', 'anna','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

B. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - "(..).*\1"

regex_expr = "(..).*\\1"
test_cases = c('abab', 'church','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)

C. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)- "(.).\1.\1.*"

regex_expr = "(.).*\\1.*\\1.*"
test_cases = c('ababab', 'church','eleven','"anna"', 'Maria','""','"able was I ere I saw hannah"')
str_view_all(test_cases, regex_expr)