Assignment 3: Regex

library(tidyverse)

Exercise 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#read the dataset from github link
col_majors_df<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

str_subset(col_majors_df[[2]], "(DATA|STATISTICS)")

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Exercise 2

Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruit_str <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'

# create a fruit list
fruit_list <- str_extract_all(string = fruit_str, pattern = '\\".*?\\"')
#create a fruit string
fruit_str <- str_c(fruit_list[[1]], collapse = ', ')

str_glue('c({fruit_str})', items = fruit_str)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Exercise 3

Describe, in words, what these expressions will match:

(.)\1\1
(.) will create a capture group to match any character.
\1 will match the results of above capture group.
\1 will match the results of above capture group again.
Thus, this regular expression will search for a character that is repeated 3 times continuously.
example: aaap

string_exam <- c('aaap', 'anna','paneaean', 'asafa', 'rdghjgdr')

str_view(string_exam, "(.)\\1\\1")

“(.)(.)\2\1”
(.) will create a capture group 1 to match any character.
(.) will create a capture group 2 to match any character.
\2 will look for capture group 2.
\1 will look for capture group 1.

Thus, this regular expression will look for a two character string which is immediately followed in reverse order.
example: anna, goog

str_view(string_exam, "(.)(.)\\2\\1")

(..)\1
(..) will create a capture group to match any two characters.
\1 will match the results of above capture group.
Thus, this regular expression will search for repetition of any two characters.
example: eaea

str_view(string_exam, "(..)\\1")

“(.).\1.\1”
(.) will create a capture group to match any character.
. will look for any character.
\1 will look for capture group 1.
. will look for any character.
\1 will look for capture group 1.

Thus, in this regular expression search, 1st, 3rd and 5th character should be the same. 2nd and 4th character can be anything.

str_view(string_exam, "(.).\\1.\\1")

“(.)(.)(.).*\3\2\1”
(.) will create a capture group 1 to match any character.
(.) will create a capture group 2 to match any character.
(.) will create a capture group 3 to match any character.
.* zero or more characters.
\3 will look for capture group 3.
\2 will look for capture group 2.
\1 will look for capture group 1.

Thus, this regular expression will capture three groups of characters followed by zero or more characters and the three groups in reverse order.

str_view(string_exam, "(.)(.)(.).*\\3\\2\\1")

Exercise 4

Construct regular expressions to match words that:

Start and end with the same character.

string_examp <- c('goog', ' church', 'eleven')
str_view(string_examp, "^(.).*\\1$")

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(string_examp,"(..).*\\1")

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(string_examp,"(.).*\\1.*\\1")

Assignment 3: Regex

Khyati Naik

2022-09-17

Exercise 1

Exercise 2

Exercise 3

Exercise 4