1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS"
majors<-read.csv(paste0("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
head(majors)
##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
#str(majors)
pattern<-'DATA|STATISTICS'
grep(pattern, majors$Major, value=TRUE,ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"
#case is ignored because we cant always know if the data is lower case or upper case
  1. Write code that transforms the data below: [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
    [9] "elderberry" "lime" "lychee" "mulberry"
    [13] "olive" "salal berry" Into a format like this: c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
fruitsAndVeggies <- ' "bell pepper" "bilberry"  "blackberry" "blood orange" "blueberry"    "cantaloupe"   "chili pepper" "cloudberry" "elderberry"  "lime" "lychee"  "mulberry"  "olive"  "salal berry"'
fruitsAndVeggies
## [1] " \"bell pepper\" \"bilberry\"  \"blackberry\" \"blood orange\" \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\" \"elderberry\"  \"lime\" \"lychee\"  \"mulberry\"  \"olive\"  \"salal berry\""
#new_df<-str_extract_all(fruitsAndVeggies, "\\b[a-z]+\\b")
#new_df
#str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules.

#new_df[0:1]

extracted <- str_extract_all(fruitsAndVeggies, "\\w[a-z]+\\s?[a-z]+\\w")
print(unlist(extracted))   
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
class(unlist(extracted))
## [1] "character"
  1. Describe, in words, what these expressions will match:
  1. Construct regular expressions to match words that:
#Testing: (I ran out of time to do this better)
mylist<-list("anna", "aann", "church", "eleven")
output_1 <- str_extract_all(mylist, "^([a-z]).*\1$")
print(output_1)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
output_2<-str_extract_all(mylist, "([a-zA-Z][a-zA-Z]).*\1")
print(output_2)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
output_3<-str_extract_all(mylist, "([a-zA-Z]).*\1.*\1")
print(output_3)
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)