Regular Expression

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/; provide code that identifies the majors that contain either “DATA” or “STATISTICS”

coll_dt <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

data_stat <- coll_dt %>% 
  filter(grepl("DATA|STATISTICS", Major))
data_stat

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

#2 Write code that transforms the data below

text <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

new_text <- str_remove_all(unlist(str_extract_all(text,"\"[a-z]+.[a-z]+.\"")),"\"")
new_text

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

#3 Describe, in words, what these expressions will match

“(.)\1\1”

“\\1” matches the exact same text that was matched by the first capture group(i.e the first bracketed expression). The expression is correctly written as “(.)\\1\\1”. It repeats the character matched by (.) by two times

“(.)(.)\\2\\1”

“\\2” is a reference to the second capture group(i.e the second bracketed expression). While “\\1” matches the exact same text that was matched by the first capture group(i.e the first bracketed expression). “(.)(.)\\2\\1” returns the first match and second match by (.) and (.) respectively and repeats the second match again before the first match repetition.

“(..)\1”

“(..)\1” returns syntax error. But “(..)\\1” matches the first two characters of the expression and repeats them once.

‘(.).\\1.\\1’

(.) matches the first character. Then, “(.).\\1.\\1” returns the first match by (.) followed by another character matched by the . and a repetition of the first match by (.) and another character matched by the second . and a repetition of the first match by (.)

"(.)(.)(.).*\\3\\2\\1"

(.)(.)(.) are three distinct capture groups namely group 1, group 2 and group 3 respectively. "(.)(.)(.).\\3\\2\\1" displays the first three characters matched by the distinct capture groups followed by a character matched by the . and the characters matched by the (). Then a repetition of the third capture group, second capture group and lastly the first capture group.’

#4 Construct regular expressions to match words that:

Start and end with the same character

fruit<- c('coconut','cucumber','jujube','papaya','salal berry','eleven')
view(fruit, "(.).\\1")

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(fruit, "(..)\\1")

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(fruit, "(.).\\1.\\1")

str_view(fruit, "([A-Za-z]).\\1.\\1")