1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors_data <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv')
toMatch <-c("STATISTICS","DATA")
matches <- unique (grep(paste(toMatch,collapse="|"), 
                             majors_data$Major, value=TRUE, ignore.case = TRUE))
matches
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "STATISTICS AND DECISION SCIENCE"              
## [3] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"

2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

ab <-read.csv('https://raw.githubusercontent.com/uriahman/607_Wk_3_Hw/main/berries_peppers.txt',header=FALSE)
test_string <-'school21 is great! School is fun21. Covid sucks!'
bell<-stri_match_all_regex(ab,'[a-zA-Z]+[[:space:]]+pepper|[a-zA-Z]+[[:space:]]+berry|[a-zA-Z]+[[:space:]]+orange|[a-zA-Z]+berry|[a-zA-Z]{5,10}')
df <- data.frame(bell)
colnames(df)[1]<-'fruits'
fruits <- as.character(df['fruits'])
df['fruits']
##          fruits
## 1   bell pepper
## 2      bilberry
## 3    blackberry
## 4  blood orange
## 5     blueberry
## 6    cantaloupe
## 7  chili pepper
## 8    cloudberry
## 9    elderberry
## 10       lychee
## 11     mulberry
## 12        olive
## 13  salal berry
fruits
## [1] "c(\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"

3. Describe, in words, what these expressions will match:

a) (.)\1\1

The parentheses creates a group. The dot matches a single character. While the backslash one is the first group created. Which in this instance would be the single character repeated twice. So this regex will find any character that repeats at least three consecutive times, like “ooo”

b) “(.)(.)\2\1”

Each parentheses with a dot is a single group with one matching character. The backslashes 2 and 1 look for the groups in reverse order.

three_b <- str_subset(words,"(.)(.)\\2\\1")
three_b
##  [1] "afternoon"   "apparent"    "arrange"     "bottom"      "brilliant"  
##  [6] "common"      "difficult"   "effect"      "follow"      "indeed"     
## [11] "letter"      "million"     "opportunity" "oppose"      "tomorrow"

c) (..)\1

This is a group with two matching characters,with a single repeat of that group. So it will match ‘aaaa’or ’abab’.

d) “(.).\1.\1”

Single group of a character followed by a single character, then group 1 repeated followed by any character then ending in the group one character. Basically a group that is repeated every other character

three_d <- str_subset(words,"(.).\\1.\\1")
three_d
## [1] "eleven"

e) "(.)(.)(.).*\3\2\1"

Three groups of single characters followed by a anything character followed by the groups in reverse order.So it will be three characters followed by any characters then the three characters backwards.

three_e <- str_subset(words,"(.)(.)(.).*\\3\\2\\1")
three_e
## [1] "paragraph"

4.Construct regular expressions to match words that:

a) Start and end with the same character.

four_a <-str_subset(words,"^(.)((.*\\1$)|\\1?$)")
four_a
##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "depend"     "educate"    "else"       "encourage"  "engine"    
## [11] "europe"     "evidence"   "example"    "excuse"     "exercise"  
## [16] "expense"    "experience" "eye"        "health"     "high"      
## [21] "knock"      "level"      "local"      "nation"     "non"       
## [26] "rather"     "refer"      "remember"   "serious"    "stairs"    
## [31] "test"       "tonight"    "transport"  "treat"      "trust"     
## [36] "window"     "yesterday"

b) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

four_b<-str_subset(words,"([A-Za-z][A-Za-z]).*\\1")
four_b
##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

c) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

four_c<-str_subset(words, "([a-z]).*\\1.*\\1")
four_c
##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"