A good ol’ string manipulation hootenanny using R. Also, ŗ̵̄͒̌̽͆̓͊͛e̶͇̬̞͙̩͚̖͆̂̆̒̕̚̕̚g̷͓̭̘̪̦͍̓͗̎̋̽̈́̕͜e̵̫̼̞̮͙͆̍̇̾͝x̴͇͙͓̟̠̍ shenanigans.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
Alright, well, let’s start with getting the data.
majors = read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
Now we get the majors that contain the substring DATA or STATISTICS. There are lots of ways to do this both natively and with other libraries such as %like% from data.table.
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Write code that transforms the data below: [1] “bell pepper”
“bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe”
“chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry” Into a format like this: c(“bell pepper”,
“bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”,
“chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”,
“mulberry”, “olive”, “salal berry”)
My interpretation for this question is transforming a string matching the former format to a string matching the latter. We’ll need to strip out the item counter at the start of each line, put everything on the same line, trim the spaces, add the commas, wrap it all in parentheses, and finally prefix it with the c.
origFruit <- r"--([1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry")--"
str_view(origFruit)## [1] │ [1] "bell pepper" "bilberry" "blackberry" "blood orange"
## │ [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
## │ [9] "elderberry" "lime" "lychee" "mulberry"
## │ [13] "olive" "salal berry"
newFruit <- str_remove_all(origFruit, "\\[.*\\]")
newFruit <- str_remove_all(newFruit, "\\\n")
newFruit <- str_replace_all(newFruit, "\\s+", " ")
newFruit <- str_replace_all(newFruit, "\" \"", "\", \"")
newFruit <- trimws(newFruit)
newFruit <- paste("c(", newFruit, sep="")
newFruit <- paste(newFruit,")", sep="")
str_view(newFruit)## [1] │ c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Describe, in words, what these expressions will match:
(.)\1\1
Character repeated three times consecutively anywhere in a string.
e.g. aaa
“(.)(.)\\2\\1”
Two characters followed by \2\1 wrapped in quotes.
e.g. “21\2\1”, “c”\2\1”, etc
(..)\1
Two characters followed by the same two characters.
e.g. abab
“(.).\\1.\\1”
Two characters followed by a backslash followed by \1 followed by any
character followed by \1 all wrapped in quotes.
e.g. “ab\1a\1”
“(.)(.)(.).*\\3\\2\\1”
Three characters followed by any number of characters followed by \3
followed by \2 followed by \1 all wrapped in quotes.
e.g. “abcDEFQRESXYZ\3\2\1”
Construct regular expressions to match words that:
Start and end with the same character.
^(.).*\1$
Matching first and last characters with anything else in
between
Contain a repeated pair of letters (e.g. “church”
contains “ch” repeated twice.)
(..).*\1
Very similar to the first regex, but instead we have it be two
characters instead of one and dont force match to be first and last
character, but instead anywhere in the string
Contain one letter repeated in at least three places
(e.g. “eleven” contains three “e”s.)
(.).*\1.*\1
Similar to previous, but one character instead of two, and repeated
minimum twice instead of just once