- Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS"
majors<-read.csv(paste0("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
head(majors)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
#str(majors)
pattern<-'DATA|STATISTICS'
grep(pattern, majors$Major, value=TRUE,ignore.case = TRUE)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
#case is ignored because we cant always know if the data is lower case or upper case
- Write code that transforms the data below: [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry" Into a format like this: c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
fruitsAndVeggies <- ' "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry" "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime" "lychee" "mulberry" "olive" "salal berry"'
fruitsAndVeggies
## [1] " \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\" \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \"olive\" \"salal berry\""
#new_df<-str_extract_all(fruitsAndVeggies, "\\b[a-z]+\\b")
#new_df
#str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules.
#new_df[0:1]
extracted <- str_extract_all(fruitsAndVeggies, "\\w[a-z]+\\s?[a-z]+\\w")
print(unlist(extracted))
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
class(unlist(extracted))
## [1] "character"
- Describe, in words, what these expressions will match:
- (.) : Any character 3 times in a row.
- "(.)(.)\2\1" : 2 characters that repeat immediately in the reverse order.
- (..) : 2 characters that repeat immediately in the same order.
- "(.).\1.\1" : Single character that repeats 2 more times, with each repetition after another single character.
- "(.)(.)(.).*\3\2\1" : Any 3 characters that repeat in the reverse order after any number variable characters.
- Construct regular expressions to match words that:
- Start and end with the same character.
^([a-z]).*\1$
- Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
([a-zA-Z][a-zA-Z]).*\1
- Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) -
([a-zA-Z]).*\1.*\1
#Testing: (I ran out of time to do this better)
mylist<-list("anna", "aann", "church", "eleven")
output_1 <- str_extract_all(mylist, "^([a-z]).*\1$")
print(output_1)
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
output_2<-str_extract_all(mylist, "([a-zA-Z][a-zA-Z]).*\1")
print(output_2)
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
output_3<-str_extract_all(mylist, "([a-zA-Z]).*\1.*\1")
print(output_3)
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)